- 20 Mar, 2021 1 commit
-
-
Myle Ott authored
* Add new test for weight init (fails) * Set FSDP.compute_device so summon_full_params works before module moves to CUDA * Override FSDP.apply to enable custom weight init
-
- 19 Mar, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* param buckets * unifying the buckets
-
msbaines authored
-
- 18 Mar, 2021 4 commits
-
-
Benjamin Lefaudeux authored
* extracting the buckets in a dedicated class, fixing the resize_ bug * adding a unit test * copyright
-
Min Xu authored
* [feat] FSDP: add auto_wrap_bn - add an utility function to handle wrapping of BN * changelog
-
Min Xu authored
* [feature] FSDP: enable pytorch SyncBN - not fully validated yet but at least not asserting - this enables VISSL to move forward with its next PR * add the test file * changelog and lint * addressed comment
-
Benjamin Lefaudeux authored
-
- 17 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Deactivating buckets for a single rank, not crashing but not useful
-
- 12 Mar, 2021 2 commits
-
-
Min Xu authored
* FSDP: multi-pass autograd graph and mixed precision - added BACKWARD_PRE/POST checking - better assert_state - fixed issue of backward hook misfiring * fix * cleanup * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Co-authored-by:
Myle Ott <myleott@fb.com> Co-authored-by:
Myle Ott <myleott@fb.com>
-
msbaines authored
-
- 11 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene * adding a proper unit test * adding a unit test for https://github.com/facebookresearch/fairscale/pull/510
-
- 09 Mar, 2021 2 commits
- 08 Mar, 2021 1 commit
-
-
Min Xu authored
* [fix]: handle inputs with containers - this is an issue surfaces by vissl as well - fix seems to be super simple - also cleaned up two tests with respect to multiple such tests running back to back (they don't do that presently) * cleanup * fix * lint
-
- 06 Mar, 2021 1 commit
-
-
Myle Ott authored
-
- 05 Mar, 2021 2 commits
-
-
Min Xu authored
* [refactor] enhance wrap and auto_wrap - Two things were done in this PR 1. We don't need to import FSDP in wrap.py since the wrapper class type is stored in the context now. 2. We can use a `auto_wrap_policy` function to customize wrapping policy for auto_wrap, including size of module, blacklist, exclude list - The auto_wrap function got simplified a bit as a minor side effect. * Update fairscale/nn/wrap/auto_wrap.py Co-authored-by:Sean Naren <sean@grid.ai> * addressed comments * addressed more comments Co-authored-by:
Sean Naren <sean@grid.ai>
-
Benjamin Lefaudeux authored
* [perf][minor] cache the rank lookups, small shardedddp perf fix * tiny improvement, code quality
-
- 04 Mar, 2021 2 commits
-
-
Min Xu authored
* [feat]: checkpoint and normalization - added special handling of BN for track_running_stats and checkpointing - we test BN/LN and checkpointing - we test them with mixed precision
-
Sam Shleifer authored
-
- 03 Mar, 2021 2 commits
- 02 Mar, 2021 2 commits
-
-
Myle Ott authored
-
Sean Naren authored
This adds a context manager that assists in making child modules with similar defaults. Usage: ``` from fairscale.nn.misc import enable_wrap, wrap with enable_wrap(**handleful_of_important_params): layer_1 = wrap(torch.nn.Linear(5, 5)) layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like # without the context manager, creates Linear layer layer_1 = wrap(torch.nn.Linear(5, 5)) ``` If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.
-
- 01 Mar, 2021 2 commits
-
-
Min Xu authored
* [chores]: CI py39 on GPU and more efficiency * add test list files * fix * add test list files * split benchmark run into 2 runs * fix 1.8 version and balance benchmarks * fix * fix * fix * fix * recording tests * py39 install fix * test again * move tests * reorg tests * skip tests for torch 1.8 due to an upstream bug * removed __init__.py from tests since it confuses pytest * Revert "removed __init__.py from tests since it confuses pytest" This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0. * don't include __init__ in file list * notes on __init__.py and added missing ones * fixed mypy in a test file * balance test runtime * better pip install * balance more * pip fix * balance * balance more, all test should finish within 20m now * minor license update * trying cu102 * more doc and addressed Ben's comments * debugging * debugging...
-
Min Xu authored
* [test] FSDP: add the failing test for #421 * skip on 1.5 * better skipping * Update tests/nn/data_parallel/test_fsdp_grad_scaler.py Co-authored-by:
Sam Shleifer <sshleifer@gmail.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 27 Feb, 2021 1 commit
-
-
Min Xu authored
* [fix] FSDP corner case of all params at in the children * lint * fix * tradeoff * fix doc build * review comments
-
- 26 Feb, 2021 3 commits
- 25 Feb, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* bring back a fix from FSDP, may help a few existing users
-
Min Xu authored
-
- 24 Feb, 2021 1 commit
-
-
Myle Ott authored
-
- 23 Feb, 2021 4 commits
-
-
Min Xu authored
* [test]: add peak mem in checkpoint test * more debugging * new test * more fix * better collection of debug in case of future failures * update the comment * typo * comment * clarify * better wording
-
Benjamin Lefaudeux authored
* POC, testing against the DDP comm hook when available * docs, adding a reference to DDP's compress hook * updating changelog, prep for v0.1.8 release
-
Min Xu authored
* [bug]: not all CUDA memory is freed when model is deleted * fixed memory leak - without this, peak memory will be high when more than one model is trained (i.e. first model leave staff around pushing up the peak memory when the second model runs) * addressed comments * fix * changelog
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 19 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* test with and without buckets for all the shardedDDP unit tests * parametrize all the things * refactoring, adding even more combinations at times * handle hosts not having cuda
-
- 18 Feb, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* Adding multiple groups support to ShardedDDP + unit test * adding gloo to the backends tested for multiple groups
-
Benjamin Lefaudeux authored
* [fix] ShardedDDP train/eval modes * Update CHANGELOG.md
-
- 17 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* initial implementation, with unit test and assert * added changelog and better debug string
-