- 27 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 21 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 06 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 05 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* making APIs more private * linting
-
- 04 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 19 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* param buckets * unifying the buckets
-
- 17 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 15 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* extending the current state_dict interface, make it possible to do everything in a single call, and to checkpoint on all ranks
-
- 11 Mar, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene * adding a proper unit test * adding a unit test for https://github.com/facebookresearch/fairscale/pull/510
-
Benjamin Lefaudeux authored
-
- 09 Mar, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* seemingly fix flakyness for gloo by checking all coms handles
-
Benjamin Lefaudeux authored
-
- 05 Mar, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* [perf][minor] cache the rank lookups, small shardedddp perf fix * tiny improvement, code quality
-
Benjamin Lefaudeux authored
* change empty shard handling for OSS, do not rely on asserts * code review
-
- 23 Feb, 2021 1 commit
-
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 22 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding an assert + corresponding unit test * updated changelog * adjusting the adascale tests
-
- 19 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* test with and without buckets for all the shardedDDP unit tests * parametrize all the things * refactoring, adding even more combinations at times * handle hosts not having cuda
-
- 14 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* WIP, needs to be fixed ! * should be a fix, many thanks Weiyi Zheng * slightly better unit test, sorting the states on the way out * reproducing the issue from Weiyi in a unit test, and finally properly fixing * fixing unit test on pytorch1.5 - original loss diff 26.404895782470703 - 26.404342651367188
-
- 12 Feb, 2021 3 commits
-
-
Benjamin Lefaudeux authored
This reverts commit 8be9d930.
-
Benjamin Lefaudeux authored
* many thanks Weiyi Zheng
-
Benjamin Lefaudeux authored
* Better unit testing * Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time * Enabling accumulation tests
-
- 08 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* flat params all along, way simpler * updating the docstring
-
- 05 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
fix a broken earlier commit, only worked for the first step
-
- 04 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
cache this iterator, easy speed up
-
- 02 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding a test to prove the inter operability with upstream pytorch * updating the changelog * eager state pruning * pytorch 1.5 compat
-
- 27 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* removing the torch util altogether, broken on 1.7.1
-
- 26 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* fix for torch dist broadcast failing on dummy object
-
- 21 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Couple of small improvements, no logic changes
-
- 20 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 11 Jan, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing * fixing oss backcompat, trying to fix rpc in old pytorch also * fixing the file based init in torch 1.5
-
Benjamin Lefaudeux authored
* min bucket size with model size * resize the bucket after all the params have been squeezed in, save a tiny bit of memory * minor, ensure that the cache is freed and improve the comments
-
- 08 Jan, 2021 3 commits
-
-
Benjamin Lefaudeux authored
-
Benjamin Lefaudeux authored
-
Joshua Meier authored
* add additional unit test * support model parallelism in oss
-
- 30 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* removing a dead call since ShardedDDP, small speedup * unrelated, but filling in the changelog * another nit
-
- 22 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* fix, one liner * adjust so that frozen trunks get spread still, even if this should have little consequences * removing dead code, hopeful unit test fix * now with some linting.. * adding a proper unit test case
-
- 19 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259) * Getting rid of the "should bucket" hash table, just use a list Properly handle all params, with or without requires_grad * make sure that this case is unit tested
-
- 17 Dec, 2020 2 commits
-
-
Joshua Meier authored
-
Benjamin Lefaudeux authored
* typo, sorry about that * small perf fix
-
- 16 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* Better handling of the callback queue, try to consume it as we go. * dumping buckets for the reduce part, always the same unused params issue
-