1. 11 May, 2021 1 commit
    • Min Xu's avatar
      [fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3
      Min Xu authored
      
      
      * [fix] FSDP forward pass overlap between compute and all-gather
      
      - much thanks for @cyanguwa for report and @QuentinDuval for debugging it
      - a new unit test is added to check for this and ensure we detect
        issue with overlapping and cpu/gpu blocking wait calls
      
      * fix
      
      * fix
      
      * fix
      
      * better assertion outputs
      
      * fix format and tune all_gather mb for CI
      
      * more tuning with non_flatten
      
      * undo an accidental change
      
      * tuning all gather mb and del model
      
      * Update + fix overlapping test to use patched all_gather w/ delay (#672)
      
      * fixing get_cycles_per_ms
      
      * add get_smi_memory
      
      * update the docstring
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      Co-authored-by: default avatarMyle Ott <myleott@fb.com>
      8a42a8e3
  2. 08 May, 2021 1 commit
  3. 07 May, 2021 1 commit
    • msbaines's avatar
      [feat] experimental.nn.SyncBatchNorm: initial commit (#662) · f0a40046
      msbaines authored
      * [feat] experimental.nn.SyncBatchNorm: initial commit
      
      Fast/simple re-implementation of SyncBatchNorm.
      
      When profiling SSL Vision, I was seeing a majority of cycles spent in
      SyncBatchNorm. With this change, I see a 10% to 20% speedup on the
      model I was profiling.
      
      When running benchmarks/experimental/sync_batchnorm.py on 8 x V100,
      I get a 6x speedup:
      
      <class 'torch.nn.modules.batchnorm.BatchNorm2d'>
      Elapsed time is  0.08709120750427246
      Elapsed time is  0.12632274627685547
      Elapsed time is  0.14095258712768555
      Elapsed time is  0.16529417037963867
      Elapsed time is  0.1419970989227295
      Elapsed time is  0.15166854858398438
      Elapsed time is  0.12000870704650879
      Elapsed time is  0.17534875869750977
      <class 'torch.nn.modules.batchnorm.SyncBatchNorm'>
      Elapsed time is  2.5087168216705322
      Elapsed time is  2.497001886367798
      Elapsed time is  2.5204885005950928
      Elapsed time is  2.526789903640747
      Elapsed time is  2.5080230236053467
      Elapsed time is  2.524489641189575
      Elapsed time is  2.513214588165283
      Elapsed time is  2.5359973907470703
      <class 'fairscale.experimental.nn.sync_batchnorm.SyncBatchNorm'>
      Elapsed time is  0.4126114845275879
      Elapsed time is  0.39051294326782227
      Elapsed time is  0.40685415267944336
      Elapsed time is  0.4159870147705078
      Elapsed time is  0.42383885383605957
      Elapsed time is  0.4080159664154053
      Elapsed time is  0.41202712059020996
      Elapsed time is  0.42400121688842773
      f0a40046
  4. 05 May, 2021 1 commit
    • Min Xu's avatar
      [fix] add clear_autocast_cache flag (#650) · 861b5ce2
      Min Xu authored
      
      
      * [fix] add clear_autocast_cache flag
      
      - when training in AMP model with weight dtype32, FSDP may need to
        optionally clear the autocast cache to avoid GPU OOM
      - this flag is default false, automatically doing it is a future TODO
      - also added a verbose flag to make print(fsdp_model) a bit shorter
      - updated the memory test to cover those new code
      - added a couple of useful functions in parallel.py and testing.py
      
      * minor
      
      * address comments
      
      * format
      
      * improve the test
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      861b5ce2
  5. 26 Apr, 2021 1 commit
  6. 07 Apr, 2021 1 commit
  7. 29 Mar, 2021 1 commit
  8. 28 Mar, 2021 1 commit
  9. 26 Mar, 2021 1 commit
  10. 19 Mar, 2021 1 commit
  11. 04 Mar, 2021 2 commits
    • Min Xu's avatar
      [feat]: checkpoint and normalization (#457) · 5e64d6a7
      Min Xu authored
      * [feat]: checkpoint and normalization
      
      - added special handling of BN for track_running_stats and checkpointing
      - we test BN/LN and checkpointing
      - we test them with mixed precision
      5e64d6a7
    • Min Xu's avatar
      [test] AdaScale & SDP/FSDP (#468) · efed9cee
      Min Xu authored
      - cover them in terms of code path only
      - numerically, AdaScale is different on SDP/FSDP than DDP, mainly
        due to partial view of the gradients.
      - this doesn't mean it is definitely not useful but it is yet to
        be validated.
      - not going to spend too much time until we have a real use case.
      efed9cee
  12. 02 Mar, 2021 1 commit
    • Sean Naren's avatar
      [feat] Add context manager to FSDP for easier child module wrapping (#446) · f3359550
      Sean Naren authored
      This adds a context manager that assists in making child modules with similar defaults.
      Usage:
      ```
      from fairscale.nn.misc import enable_wrap, wrap
      
      with enable_wrap(**handleful_of_important_params):
          layer_1 = wrap(torch.nn.Linear(5, 5))
          layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like
      
      # without the context manager, creates Linear layer
      layer_1 = wrap(torch.nn.Linear(5, 5))
      ```
      If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.
      f3359550
  13. 26 Feb, 2021 1 commit
  14. 23 Feb, 2021 1 commit
    • Myle Ott's avatar
      Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e
      Myle Ott authored
      Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336
      
      ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.
      
      Compared to PyTorch DDP:
      * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
      * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
      * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
          * all-gather parameters at start of forward pass and start of backward pass
          * reduce-scatter grads at end of backward pass
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      15512d9e
  15. 12 Feb, 2021 1 commit
  16. 10 Feb, 2021 1 commit
  17. 27 Jan, 2021 1 commit
  18. 21 Jan, 2021 1 commit
  19. 11 Jan, 2021 1 commit
  20. 08 Jan, 2021 2 commits
  21. 16 Dec, 2020 1 commit
    • Min Xu's avatar
      [feat]: AdaScale work with lr_scheduler and tests, examples (#229) · d65cd838
      Min Xu authored
      * [doc]: AdaScale example and notes
      
      * formatted notes correctly as suggested by Benjamin
      
      * added feature and unit test to make sure lr_scheduler works
      
      * update the example with lr_scheduler
      
      * fixed doc with "make html"
      
      * addressed Mike's suggestions
      d65cd838
  22. 01 Dec, 2020 1 commit
  23. 21 Nov, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] ShardedDataParallel with autoreduce (#157) · ad933b34
      Benjamin Lefaudeux authored
      * rewrite using autograd and Variable execution queue to make the reduce automatic
      * share buckets with OSS to remove duplication
      * some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up
      ad933b34
  24. 18 Nov, 2020 1 commit
  25. 16 Nov, 2020 1 commit
  26. 11 Nov, 2020 1 commit
  27. 10 Nov, 2020 1 commit
    • Tom Birch's avatar
      Single-process control via PipeRPCWrapper (#156) · 5d4f50fb
      Tom Birch authored
      Adds support for:
      * Reused layers (e.g. for weight sharing)
      * Lazily-constructed layers
      * Single-process control via PipeRPCWrapper
      * PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive
      
      Also added examples for multi-process and PipeRPCWrapper
      5d4f50fb
  28. 28 Oct, 2020 1 commit
  29. 23 Oct, 2020 1 commit
  30. 21 Oct, 2020 1 commit
    • Min Xu's avatar
      [fix] fixing adascale all_reduce (#155) · 6802ad49
      Min Xu authored
      - Aurick noticed this bug and I ran into it yesterday
      - after the fix, our cifar training shows same gain values from
        different replics now:
      
      ```
      20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3512124098087777
      20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3512124098087777
      20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000600 fwd 0:00:00.003678 loss 0:00:00.000086 bwd 0:00:00.314158 update 0:00:00.002132 rest 0:00:00.000399
      20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000643 fwd 0:00:00.003460 loss 0:00:00.000084 bwd 0:00:00.314678 update 0:00:00.002001 rest 0:00:00.000408
      20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3514997779980324
      20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3514997779980324
      20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000732 fwd 0:00:00.003689 loss 0:00:00.000086 bwd 0:00:00.314176 update 0:00:00.002146 rest 0:00:00.000397
      20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000646 fwd 0:00:00.003542 loss 0:00:00.000089 bwd 0:00:00.314549 update 0:00:00.001956 rest 0:00:00.000392
      20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.352149646693932
      20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.352149646693932
      ```
      6802ad49
  31. 20 Oct, 2020 2 commits
  32. 14 Oct, 2020 1 commit
  33. 02 Oct, 2020 1 commit
  34. 17 Sep, 2020 1 commit
    • Tom Birch's avatar
      Multi-process pipe (#90) · 63f7796a
      Tom Birch authored
      Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
      * Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
      * Added support for lazy construction of modules (see lazy_construction for an example)
      * Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
      * Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess
      63f7796a
  35. 16 Sep, 2020 1 commit
  36. 03 Sep, 2020 1 commit
    • Jun Ru Anderson's avatar
      Add grad scaler (#48) · b6a5e634
      Jun Ru Anderson authored
      
      
      Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge.
      Co-authored-by: default avatarJun Ru Anderson <andersonic@fb.com>
      b6a5e634
  37. 27 Aug, 2020 1 commit