- 14 May, 2021 1 commit
-
-
Min Xu authored
* [minor] use dist.group.WORLD for default process group - this is slightly more efficient than the previous commit for get_process_group_cached. * fix * better fix * fixed for pytorch 1.6 and 1.7 * Update fairscale/utils/parallel.py Co-authored-by:
Min Xu <min.xu@acm.org> Co-authored-by:
Min Xu <min.xu.public@gmail.com>
-
- 07 May, 2021 1 commit
-
-
msbaines authored
* [feat] experimental.nn.SyncBatchNorm: initial commit Fast/simple re-implementation of SyncBatchNorm. When profiling SSL Vision, I was seeing a majority of cycles spent in SyncBatchNorm. With this change, I see a 10% to 20% speedup on the model I was profiling. When running benchmarks/experimental/sync_batchnorm.py on 8 x V100, I get a 6x speedup: <class 'torch.nn.modules.batchnorm.BatchNorm2d'> Elapsed time is 0.08709120750427246 Elapsed time is 0.12632274627685547 Elapsed time is 0.14095258712768555 Elapsed time is 0.16529417037963867 Elapsed time is 0.1419970989227295 Elapsed time is 0.15166854858398438 Elapsed time is 0.12000870704650879 Elapsed time is 0.17534875869750977 <class 'torch.nn.modules.batchnorm.SyncBatchNorm'> Elapsed time is 2.5087168216705322 Elapsed time is 2.497001886367798 Elapsed time is 2.5204885005950928 Elapsed time is 2.526789903640747 Elapsed time is 2.5080230236053467 Elapsed time is 2.524489641189575 Elapsed time is 2.513214588165283 Elapsed time is 2.5359973907470703 <class 'fairscale.experimental.nn.sync_batchnorm.SyncBatchNorm'> Elapsed time is 0.4126114845275879 Elapsed time is 0.39051294326782227 Elapsed time is 0.40685415267944336 Elapsed time is 0.4159870147705078 Elapsed time is 0.42383885383605957 Elapsed time is 0.4080159664154053 Elapsed time is 0.41202712059020996 Elapsed time is 0.42400121688842773
-
- 28 Mar, 2021 1 commit
-
-
msbaines authored
-
- 19 Mar, 2021 1 commit
-
-
msbaines authored
-
- 23 Feb, 2021 1 commit
-
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 11 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing * fixing oss backcompat, trying to fix rpc in old pytorch also * fixing the file based init in torch 1.5
-
- 08 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 01 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
-
- 21 Nov, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* rewrite using autograd and Variable execution queue to make the reduce automatic * share buckets with OSS to remove duplication * some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up
-
- 10 Nov, 2020 1 commit
-
-
Tom Birch authored
Adds support for: * Reused layers (e.g. for weight sharing) * Lazily-constructed layers * Single-process control via PipeRPCWrapper * PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive Also added examples for multi-process and PipeRPCWrapper
-
- 28 Oct, 2020 1 commit
-
-
msbaines authored
-
- 14 Oct, 2020 1 commit
-
-
msbaines authored
-
- 17 Sep, 2020 1 commit
-
-
Tom Birch authored
Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines) * Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess * Added support for lazy construction of modules (see lazy_construction for an example) * Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv * Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess
-
- 31 Jul, 2020 1 commit
-
-
Tom Birch authored
-
- 08 Jul, 2020 1 commit
-
-
Mandeep Singh Baines authored
-