- 26 Feb, 2021 3 commits
- 25 Feb, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* bring back a fix from FSDP, may help a few existing users
-
Min Xu authored
-
- 24 Feb, 2021 1 commit
-
-
Myle Ott authored
-
- 23 Feb, 2021 4 commits
-
-
Min Xu authored
* [test]: add peak mem in checkpoint test * more debugging * new test * more fix * better collection of debug in case of future failures * update the comment * typo * comment * clarify * better wording
-
Benjamin Lefaudeux authored
* POC, testing against the DDP comm hook when available * docs, adding a reference to DDP's compress hook * updating changelog, prep for v0.1.8 release
-
Min Xu authored
* [bug]: not all CUDA memory is freed when model is deleted * fixed memory leak - without this, peak memory will be high when more than one model is trained (i.e. first model leave staff around pushing up the peak memory when the second model runs) * addressed comments * fix * changelog
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 19 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* test with and without buckets for all the shardedDDP unit tests * parametrize all the things * refactoring, adding even more combinations at times * handle hosts not having cuda
-
- 18 Feb, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* Adding multiple groups support to ShardedDDP + unit test * adding gloo to the backends tested for multiple groups
-
Benjamin Lefaudeux authored
* [fix] ShardedDDP train/eval modes * Update CHANGELOG.md
-
- 17 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* initial implementation, with unit test and assert * added changelog and better debug string
-
- 12 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Better unit testing * Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time * Enabling accumulation tests
-
- 10 Feb, 2021 1 commit
-
-
Myle Ott authored
* Add fairscale.utils.containers Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> * Add fairscale.nn.misc.checkpoint_activations Co-authored-by:
Sam Shleifer <sshleifer@gmail.com> Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 09 Feb, 2021 1 commit
-
-
msbaines authored
-
- 04 Feb, 2021 4 commits
-
-
msbaines authored
-
Benjamin Lefaudeux authored
* Adding a proper ddp parity / AMP unit test, overdue * catch non-AMP pytorch
-
msbaines authored
-
msbaines authored
-
- 03 Feb, 2021 2 commits
-
-
msbaines authored
-
Benjamin Lefaudeux authored
* adding the .to(device) support + unit testing * doc update
-
- 02 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* no idea about the root issue, but it proved to be fairly narrowed (gloo+cpu+python3.8+no cuda installed) so I guess that's out of scope for fairscale
-
- 30 Jan, 2021 1 commit
-
-
msbaines authored
-
- 29 Jan, 2021 1 commit
-
-
msbaines authored
-
- 27 Jan, 2021 1 commit
-
-
msbaines authored
-
- 23 Jan, 2021 1 commit
-
-
Siddharth Goyal authored
* Add AMPnet implementation (clean version) * Move ampnet to experimental * Move stuff around pipeline * Address review comments and fix pre-commit errors * Refactor and modify delegate functionality * Modify header in pipe.py
-
- 21 Jan, 2021 3 commits
-
-
Benjamin Lefaudeux authored
* working around broken mypy
-
Myle Ott authored
-
Myle Ott authored
-
- 15 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* minor, but ease of life, one less papercut
-
- 11 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing * fixing oss backcompat, trying to fix rpc in old pytorch also * fixing the file based init in torch 1.5
-
- 05 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding the pytest timeout plugin to properly root out hanging tests * removing redundant code, slightly more reasonable timeout, works on single cuda * finding the root bug for some of the cpu hangs, rpc init * propagating all the rpc init test changes to the pipe and model parallel tests
-
- 02 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* fix typo, backend for CPU test
-
- 30 Dec, 2020 1 commit
-
-
Sean Naren authored
* Add function to add handle for sync BN * Add test to ensure batch norm handles have been added
-
- 29 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* catching properly a given test failing if not enough gpus
-
- 28 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* file based dist init * nicer handling of broken world sizes vs. number of available GPUs, do not break but warn out
-
- 19 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259) * Getting rid of the "should bucket" hash table, just use a list Properly handle all params, with or without requires_grad * make sure that this case is unit tested
-
- 10 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* unit test checking ddp and sharded_ddp equivalence, reproducing the issue that Sean spotted * fixing the issue, not counting requests in flight properly * adding a multiple optimizers case
-