- 20 Apr, 2021 1 commit
-
-
Sam Shleifer authored
-
- 19 Apr, 2021 1 commit
-
-
Min Xu authored
* FSDP: fixing training with freezing weights - an assert is changed to catch this case correctly - unit test added (based on Quentin's test code) for this case and compare DDP and FSDP fixes: #610 * added test file to list 1 * Use better and simpler code as suggested by Myle * testing both methods of freezing as well Co-authored-by:Min Xu <min.xu@acm.org>
-
- 14 Apr, 2021 1 commit
-
-
Myle Ott authored
-
- 13 Apr, 2021 1 commit
-
-
Sam Shleifer authored
-
- 08 Apr, 2021 1 commit
-
-
Sam Shleifer authored
-
- 07 Apr, 2021 1 commit
-
-
Myle Ott authored
-
- 04 Apr, 2021 1 commit
-
-
Sam Shleifer authored
-
- 03 Apr, 2021 1 commit
-
-
Shruti Bhosale authored
-
- 31 Mar, 2021 1 commit
-
-
Min Xu authored
[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556) * [fix] disable single rank process group for auto_wrap_bn - beefed up unit test with regnet-like model - found that single-rank process group is causing problem - disabled it to enable convergence tests on the vissl side - use `raise e from None` to get a better assertion output in testing.py. * [test] fix regnet test for ddp+mixed_precision - need AMP context in FSDP - workaround different between ddp & fsdp when bias=True - fixed a bug in input data generation that caused different ranks have the same data with wrong iteration count. - added TODO for need a better loss and grad_scaler and reduced iters so there is no nan. - added a (disabled) debugging code * lint * lint * add scaler * lint * scaler * add a real loss * seeding in the ranks * blance tests * run AMP DDP==FSDP test only on cuda version 11 and up * add relu inplace and comment * make wrap_bn covers more cases in full precision mode
-
- 25 Mar, 2021 1 commit
-
-
Sam Shleifer authored
Co-authored-by:Min Xu <24926999+min-xu-ai@users.noreply.github.com>
-
- 20 Mar, 2021 1 commit
-
-
Myle Ott authored
* Add new test for weight init (fails) * Set FSDP.compute_device so summon_full_params works before module moves to CUDA * Override FSDP.apply to enable custom weight init
-
- 18 Mar, 2021 3 commits
-
-
Myle Ott authored
-
Min Xu authored
* [feat] FSDP: add auto_wrap_bn - add an utility function to handle wrapping of BN * changelog
-
Min Xu authored
* [feature] FSDP: enable pytorch SyncBN - not fully validated yet but at least not asserting - this enables VISSL to move forward with its next PR * add the test file * changelog and lint * addressed comment
-
- 17 Mar, 2021 1 commit
-
-
Min Xu authored
-
- 12 Mar, 2021 1 commit
-
-
Min Xu authored
* FSDP: multi-pass autograd graph and mixed precision - added BACKWARD_PRE/POST checking - better assert_state - fixed issue of backward hook misfiring * fix * cleanup * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Co-authored-by:
Myle Ott <myleott@fb.com> Co-authored-by:
Myle Ott <myleott@fb.com>
-
- 09 Mar, 2021 3 commits
-
-
Myle Ott authored
-
Myle Ott authored
-
Sam Shleifer authored
-
- 08 Mar, 2021 3 commits
-
-
Myle Ott authored
-
Sam Shleifer authored
* Document FSDP tips and tricks in a separate file
-
Min Xu authored
* [fix]: handle inputs with containers - this is an issue surfaces by vissl as well - fix seems to be super simple - also cleaned up two tests with respect to multiple such tests running back to back (they don't do that presently) * cleanup * fix * lint
-
- 06 Mar, 2021 1 commit
-
-
Myle Ott authored
-
- 04 Mar, 2021 1 commit
-
-
Sam Shleifer authored
-
- 02 Mar, 2021 2 commits
-
-
Myle Ott authored
-
Sean Naren authored
This adds a context manager that assists in making child modules with similar defaults. Usage: ``` from fairscale.nn.misc import enable_wrap, wrap with enable_wrap(**handleful_of_important_params): layer_1 = wrap(torch.nn.Linear(5, 5)) layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like # without the context manager, creates Linear layer layer_1 = wrap(torch.nn.Linear(5, 5)) ``` If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.
-
- 01 Mar, 2021 1 commit
-
-
Sean Naren authored
-
- 27 Feb, 2021 1 commit
-
-
Min Xu authored
* [fix] FSDP corner case of all params at in the children * lint * fix * tradeoff * fix doc build * review comments
-
- 26 Feb, 2021 2 commits
- 25 Feb, 2021 1 commit
-
-
Myle Ott authored
-
- 24 Feb, 2021 1 commit
-
-
Myle Ott authored
-
- 23 Feb, 2021 2 commits
-
-
Min Xu authored
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-