- 12 May, 2021 1 commit
-
-
anj-s authored
* rename files * add newly renamed file * rename and move checkpoint activations related files * add test files to ci list * fix lint errors * modify docs * add changelog * retain old path for now * fix lint errors * add another import test case * fix merge conflict * add missing test file
-
- 11 May, 2021 1 commit
-
-
Min Xu authored
* [fix] FSDP forward pass overlap between compute and all-gather - much thanks for @cyanguwa for report and @QuentinDuval for debugging it - a new unit test is added to check for this and ensure we detect issue with overlapping and cpu/gpu blocking wait calls * fix * fix * fix * better assertion outputs * fix format and tune all_gather mb for CI * more tuning with non_flatten * undo an accidental change * tuning all gather mb and del model * Update + fix overlapping test to use patched all_gather w/ delay (#672) * fixing get_cycles_per_ms * add get_smi_memory * update the docstring Co-authored-by:
Min Xu <min.xu@acm.org> Co-authored-by:
Myle Ott <myleott@fb.com>
-
- 07 May, 2021 1 commit
-
-
Min Xu authored
* [test]: add a more general test case - also rebalance the tests a bit * added missing arg * balance * better checking * balance * make test smaller and faster * make ddp results cached and enable sync_bn * clean up * fix tests * changelog * blance * fix * addressing comments Co-authored-by:Min Xu <min.xu@acm.org>
-
- 05 May, 2021 2 commits
-
-
Min Xu authored
* [fix] better assert and better test for frozen weights - the precise condition should have been check m.parameters(), not m.params. - fixes #643 * add changelog * use enum is so much better Co-authored-by:Min Xu <min.xu@acm.org>
-
Min Xu authored
* [fix] add clear_autocast_cache flag - when training in AMP model with weight dtype32, FSDP may need to optionally clear the autocast cache to avoid GPU OOM - this flag is default false, automatically doing it is a future TODO - also added a verbose flag to make print(fsdp_model) a bit shorter - updated the memory test to cover those new code - added a couple of useful functions in parallel.py and testing.py * minor * address comments * format * improve the test Co-authored-by:Min Xu <min.xu@acm.org>
-
- 03 May, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* fix + unit test * changelog update
-
- 28 Apr, 2021 2 commits
-
-
msbaines authored
-
Min Xu authored
* [feat] save memory by using bucket buffer only in backward - this fixes bug #627 - added documentation to clarify the buffer's cost and speed/memory tradeoff - added setup/teardown calls so that the buffer is only allocated during the backward pass, saving more memory for forward and stepping so that they can be used for things like activations. - added a unit test that assert the memory is in range. Comparing with DDP: 1. buffer size scales with # of FSDP not model size 2. buffer is only allocated during backward 3. buffer is used for small tensors only to reduce overhead 4. overlapping of compute-reduction is very different * add PR number to changelog * filled in with memory number on 1.9 * addressed comments * update comments * fix for 1.6 * add a todo Co-authored-by:Min Xu <min.xu@acm.org>
-
- 26 Apr, 2021 1 commit
-
-
Min Xu authored
* [chore] 0.3.6 release * try redo the caches Co-authored-by:Min Xu <min.xu@acm.org>
-
- 19 Apr, 2021 1 commit
-
-
Min Xu authored
* [chore] 0.3.5 release * address comment Co-authored-by:Min Xu <min.xu@acm.org>
-
- 13 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 02 Apr, 2021 1 commit
-
-
Min Xu authored
- releasing 0.3.3 - I need it in vissl for the auto_wrap_bn change
-
- 18 Mar, 2021 3 commits
-
-
Min Xu authored
-
Min Xu authored
* [feat] FSDP: add auto_wrap_bn - add an utility function to handle wrapping of BN * changelog
-
Min Xu authored
* [feature] FSDP: enable pytorch SyncBN - not fully validated yet but at least not asserting - this enables VISSL to move forward with its next PR * add the test file * changelog and lint * addressed comment
-
- 12 Mar, 2021 1 commit
-
-
Min Xu authored
* FSDP: multi-pass autograd graph and mixed precision - added BACKWARD_PRE/POST checking - better assert_state - fixed issue of backward hook misfiring * fix * cleanup * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Co-authored-by:
Myle Ott <myleott@fb.com> Co-authored-by:
Myle Ott <myleott@fb.com>
-
- 11 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene * adding a proper unit test * adding a unit test for https://github.com/facebookresearch/fairscale/pull/510
-
- 09 Mar, 2021 1 commit
-
-
Min Xu authored
* [chore] 0.3.1 release - mainly because vissl needs the new version - added a doc on release steps * Update CHANGELOG.md Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com> * review comments Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com>
-
- 25 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* bring back a fix from FSDP, may help a few existing users
-
- 23 Feb, 2021 6 commits
-
-
Benjamin Lefaudeux authored
* v0.3.0 it is, celebration time
-
Benjamin Lefaudeux authored
* POC, testing against the DDP comm hook when available * docs, adding a reference to DDP's compress hook * updating changelog, prep for v0.1.8 release
-
Min Xu authored
-
Min Xu authored
-
Min Xu authored
* [bug]: not all CUDA memory is freed when model is deleted * fixed memory leak - without this, peak memory will be high when more than one model is trained (i.e. first model leave staff around pushing up the peak memory when the second model runs) * addressed comments * fix * changelog
-
Min Xu authored
-
- 22 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding an assert + corresponding unit test * updated changelog * adjusting the adascale tests
-
- 19 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
Co-authored-by:Min Xu <24926999+min-xu-ai@users.noreply.github.com>
-
- 18 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* [fix] ShardedDDP train/eval modes * Update CHANGELOG.md
-
- 17 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* initial implementation, with unit test and assert * added changelog and better debug string
-
- 12 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Better unit testing * Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time * Enabling accumulation tests
-
- 11 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* v0.1.6
-
- 03 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 02 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding a test to prove the inter operability with upstream pytorch * updating the changelog * eager state pruning * pytorch 1.5 compat
-
- 29 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 07 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* trying to fix the missing files in the pip package (not in this diff) * adding a long description, more pypi friendly
-
- 05 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
release pip package to follow suit
-
- 04 Jan, 2021 2 commits
-
-
Benjamin Lefaudeux authored
-
Min Xu authored
* [feat] sync adascale from internal repo - tbd testing: tbd * Update argument document of __init__ * update documentation around set_num_gradients_to_accumulate * added checking code for proper API calling places * rename internal APIs to make them internal * updated changelog * added support for add_param_group and its unit test * added unit test for set_num_gradients_to_accumulate * added debias_ewma unit test * fixed test_set_num_gradients_to_accumulate (need zero_grad() call) * added missing zero_grad() to test_lr_scheduler * fixed test_add_param_group with respect to optim.zero_grad() * added test_gradient_value * added test_scale_not_equal_default for scale != world_size * grad_accum * added test_unhook() * removed print statements * fixed a typo * addressed Ben's comment
-
- 30 Dec, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* removing a dead call since ShardedDDP, small speedup * unrelated, but filling in the changelog * another nit
-
- 24 Dec, 2020 1 commit
-
-
Min Xu authored
* Update changelog missed this item from previous AdaScale commit. * More change log * Addressed review comments
-