- 04 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 03 Mar, 2021 3 commits
- 02 Mar, 2021 2 commits
-
-
Myle Ott authored
-
Sean Naren authored
This adds a context manager that assists in making child modules with similar defaults. Usage: ``` from fairscale.nn.misc import enable_wrap, wrap with enable_wrap(**handleful_of_important_params): layer_1 = wrap(torch.nn.Linear(5, 5)) layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like # without the context manager, creates Linear layer layer_1 = wrap(torch.nn.Linear(5, 5)) ``` If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.
-
- 01 Mar, 2021 3 commits
-
-
Min Xu authored
* [chores]: CI py39 on GPU and more efficiency * add test list files * fix * add test list files * split benchmark run into 2 runs * fix 1.8 version and balance benchmarks * fix * fix * fix * fix * recording tests * py39 install fix * test again * move tests * reorg tests * skip tests for torch 1.8 due to an upstream bug * removed __init__.py from tests since it confuses pytest * Revert "removed __init__.py from tests since it confuses pytest" This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0. * don't include __init__ in file list * notes on __init__.py and added missing ones * fixed mypy in a test file * balance test runtime * better pip install * balance more * pip fix * balance * balance more, all test should finish within 20m now * minor license update * trying cu102 * more doc and addressed Ben's comments * debugging * debugging...
-
Min Xu authored
* [test] FSDP: add the failing test for #421 * skip on 1.5 * better skipping * Update tests/nn/data_parallel/test_fsdp_grad_scaler.py Co-authored-by:
Sam Shleifer <sshleifer@gmail.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
Sean Naren authored
-
- 27 Feb, 2021 3 commits
-
-
vfdev authored
-
Min Xu authored
* [fix] FSDP corner case of all params at in the children * lint * fix * tradeoff * fix doc build * review comments
-
Vittorio Caggiano authored
-
- 26 Feb, 2021 7 commits
-
-
Myle Ott authored
-
Min Xu authored
-
Myle Ott authored
-
Vittorio Caggiano authored
* Update README.md
-
Min Xu authored
-
Min Xu authored
* [feat]: add summon_full_params context mgr * fix * fix * addressed comments * fixed the state_dict copy * lint
-
anj-s authored
* clean start * removing per layer split strategy, probably not that useful indeed * initial transformer benchmark * hack, enable testing ViT + offload, python3 benchmarks/oss.py --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224 * proper cuda streams and device, something off in terms of mems consumption * minor, stashing * unit test fix * removing all the distributed parts * simpler test, needs debugging * working OOP, running a model which does not fit on the gpu memory * spring cleaning * removing the ill-advised optimizer bits, better keep that orthogonal * [offload] Add support for activation offloading + other changes (#367) * initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * avoid saving inputs * fix lint errors Co-authored-by:
Anjali Sridhar <anj@devfair0443.h2.fair> * [offload] Add support for fp16 training (#374) * initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * add support for fp16 * add unit tests * fix lint errors * fix test failure Co-authored-by:
Anjali Sridhar <anj@devfair0443.h2.fair> * [offload] Add support for activation checkpointing for all layers. (#381) * initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * add support for fp16 * add unit tests * fix lint errors * fix test failure * cp work, incorrect output dimensions still need to be fixed * fixed activation outputs * intermediate cp of work * add tests * fix lint errors Co-authored-by:
Anjali Sridhar <anj@devfair0443.h2.fair> * add support for microbatches * revert benchmark config changes * add parametrization * fix lint errors and tests * skip test for 1.5 * fix lint errors * skip test if there are no GPUs * fix lint errors * fix lint errors * move experimental to the fairscale repo * lint error fixes * modify test imports * lint error fixes * move offload files to the experimental directory * move tests and benchmarks to their forlder * fix mypy errors * cp intermediate working benchmarks * more changes * split benchmark configs * remove print statements * fix lint errors * remove unused print * stress testing * remove unused file * change param nae * lint fixes * move file to the right folder * offload_experimental * add doc string * add error message Co-authored-by:
Benjamin Lefaudeux <benjamin.lefaudeux@gmail.com> Co-authored-by:
Benjamin Lefaudeux <benjamin.lefaudeux@protonmail.com> Co-authored-by:
Anjali Sridhar <anj@devfair0443.h2.fair>
-
- 25 Feb, 2021 3 commits
-
-
Benjamin Lefaudeux authored
* bring back a fix from FSDP, may help a few existing users
-
Myle Ott authored
-
Min Xu authored
-
- 24 Feb, 2021 4 commits
-
-
anj-s authored
* refactor experimental file locations * refactor fix * disable test temporarily * lint error fix * make the change in the right file * fix lint errors * skip failing tests Co-authored-by:Anjali Sridhar <anj@devfair0443.h2.fair>
-
Myle Ott authored
-
anj-s authored
-
Min Xu authored
* use weakref in the wrapper * comment * comment * Update fairscale/nn/misc/checkpoint_activations.py Co-authored-by:
Sam Shleifer <sshleifer@gmail.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 23 Feb, 2021 11 commits
-
-
Min Xu authored
* [test]: add peak mem in checkpoint test * more debugging * new test * more fix * better collection of debug in case of future failures * update the comment * typo * comment * clarify * better wording
-
Benjamin Lefaudeux authored
* v0.3.0 it is, celebration time
-
anj-s authored
* move experimental to the fairscale repo * lint error fixes * modify test imports * lint error fixes * lint errors Co-authored-by:Anjali Sridhar <anj@devfair0443.h2.fair>
-
Benjamin Lefaudeux authored
-
Benjamin Lefaudeux authored
* POC, testing against the DDP comm hook when available * docs, adding a reference to DDP's compress hook * updating changelog, prep for v0.1.8 release
-
Myle Ott authored
-
Min Xu authored
-
Min Xu authored
-
Min Xu authored
* [bug]: not all CUDA memory is freed when model is deleted * fixed memory leak - without this, peak memory will be high when more than one model is trained (i.e. first model leave staff around pushing up the peak memory when the second model runs) * addressed comments * fix * changelog
-
Min Xu authored
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 22 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding an assert + corresponding unit test * updated changelog * adjusting the adascale tests
-
- 19 Feb, 2021 2 commits
-
-
Benjamin Lefaudeux authored
Co-authored-by:Min Xu <24926999+min-xu-ai@users.noreply.github.com>
-
Min Xu authored
* [docs]: add checkpoint_wrapper and many small fixes * update copyright year
-