- 29 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Improving test coverage on SDP * using pytest exception catcher
-
- 28 Apr, 2021 3 commits
-
-
Min Xu authored
* [test] improve BN test coverage - Added sync_bn on/off cases - Added conv and linear bias on/off cases - clarified when sync_bn is off, when is BN wrapping needed with the test * adding a comment Co-authored-by:Min Xu <min.xu@acm.org>
-
Mehdi Mirzazadeh authored
* adding auto graph generation for distributed pipeline * ignore trace.py for my for now, since it needs pytorch 1.8 * fixing tests * simplifying graph api * remove unused debug utilities * use inspect to find argument lists * use sharded linear layer * flkae8 * comment * polishing * polishing
-
Min Xu authored
* [feat] save memory by using bucket buffer only in backward - this fixes bug #627 - added documentation to clarify the buffer's cost and speed/memory tradeoff - added setup/teardown calls so that the buffer is only allocated during the backward pass, saving more memory for forward and stepping so that they can be used for things like activations. - added a unit test that assert the memory is in range. Comparing with DDP: 1. buffer size scales with # of FSDP not model size 2. buffer is only allocated during backward 3. buffer is used for small tensors only to reduce overhead 4. overlapping of compute-reduction is very different * add PR number to changelog * filled in with memory number on 1.9 * addressed comments * update comments * fix for 1.6 * add a todo Co-authored-by:Min Xu <min.xu@acm.org>
-
- 26 Apr, 2021 1 commit
-
-
Min Xu authored
* [fix]: let FSDP handle model with multiple forward pass and checkpoint * try CI again * save * save * fixed case with bn * minor * add the new file * minor * added test of a single case, runtime is about 50s * enable all 8 test cases * cleanup * cleanup * skip flatten case with 1.6 and 1.7 * minor Co-authored-by:Min Xu <min.xu@acm.org>
-
- 23 Apr, 2021 1 commit
-
-
shuyingsunshine21 authored
* relax checking root condition * formatting * add unittest * add unittest to ci test list * isort for import of unittest * format black . * move test to list 1 * add skip no cuda * black and isort
-
- 22 Apr, 2021 2 commits
-
-
Min Xu authored
* [fix] mypy and flaky test - CI didn't seem to catch this or maybe I merged incorrectly yesterday - this should fix the mypy error on master - also updated a test that seems to be flaky due to tcp port conflict * another flaky test, hopefully more determinism helps * CR * skip 1.6 * fix * minor Co-authored-by:Min Xu <min.xu@acm.org>
-
Benjamin Lefaudeux authored
-
- 19 Apr, 2021 1 commit
-
-
Min Xu authored
* FSDP: fixing training with freezing weights - an assert is changed to catch this case correctly - unit test added (based on Quentin's test code) for this case and compare DDP and FSDP fixes: #610 * added test file to list 1 * Use better and simpler code as suggested by Myle * testing both methods of freezing as well Co-authored-by:Min Xu <min.xu@acm.org>
-
- 15 Apr, 2021 1 commit
-
-
anj-s authored
[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. (#608) * revert change made * add tests and revert sync shard changes * add tests * remove file checked in by error * inine var * fix lint errors * add checkpoint activation * fix mypy * use a bigger model * modify tests for now * resolve conflicts Co-authored-by:Anjali Sridhar <anj@devfair0443.h2.fair>
-
- 13 Apr, 2021 3 commits
-
-
Sam Shleifer authored
-
Mehdi Mirzazadeh authored
replacing multip-process pipe implementation with more flexible one Initial implementation of proposal pytorch/pytorch#55256
-
Benjamin Lefaudeux authored
* Adding a unit test which checks for multiple FW passes on the same block * Adding an embedding table, but still no problem to show for it
-
- 08 Apr, 2021 1 commit
-
-
Sam Shleifer authored
-
- 07 Apr, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* Properly handle .train() and .eval() modes * showing that the unit test works, now fixed * code review
-
Myle Ott authored
-
- 06 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 05 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* making APIs more private * linting
-
- 04 Apr, 2021 3 commits
-
-
Sam Shleifer authored
-
msbaines authored
This test is flaky for torch >= 1.8.0.
-
Benjamin Lefaudeux authored
-
- 02 Apr, 2021 1 commit
-
-
msbaines authored
NCCL all_to_all is now supported in PyTorch (since v1.8.0) Fixes: #548
-
- 01 Apr, 2021 1 commit
-
-
msbaines authored
-
- 31 Mar, 2021 4 commits
-
-
msbaines authored
-
anj-s authored
* renaming/adding error messages * address comments * address comments * add more comments * add more comments
-
Min Xu authored
[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556) * [fix] disable single rank process group for auto_wrap_bn - beefed up unit test with regnet-like model - found that single-rank process group is causing problem - disabled it to enable convergence tests on the vissl side - use `raise e from None` to get a better assertion output in testing.py. * [test] fix regnet test for ddp+mixed_precision - need AMP context in FSDP - workaround different between ddp & fsdp when bias=True - fixed a bug in input data generation that caused different ranks have the same data with wrong iteration count. - added TODO for need a better loss and grad_scaler and reduced iters so there is no nan. - added a (disabled) debugging code * lint * lint * add scaler * lint * scaler * add a real loss * seeding in the ranks * blance tests * run AMP DDP==FSDP test only on cuda version 11 and up * add relu inplace and comment * make wrap_bn covers more cases in full precision mode
-
msbaines authored
-
- 30 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* survive the model being moved to device post-construction * make sure that a unit test would catch a regression
-
- 29 Mar, 2021 1 commit
-
-
msbaines authored
-
- 28 Mar, 2021 1 commit
-
-
msbaines authored
-
- 26 Mar, 2021 1 commit
-
-
Min Xu authored
- added DDP equivalency test - added rmf, state_dict_norm functions to testing utils - added more debugging output to objects_are_equal
-
- 25 Mar, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* re-activating unit test * removing changed that slipped in
-
Sam Shleifer authored
Co-authored-by:Min Xu <24926999+min-xu-ai@users.noreply.github.com>
-
- 22 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 20 Mar, 2021 1 commit
-
-
Myle Ott authored
* Add new test for weight init (fails) * Set FSDP.compute_device so summon_full_params works before module moves to CUDA * Override FSDP.apply to enable custom weight init
-
- 19 Mar, 2021 3 commits
-
-
Benjamin Lefaudeux authored
* param buckets * unifying the buckets
-
msbaines authored
-
msbaines authored
-
- 18 Mar, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* extracting the buckets in a dedicated class, fixing the resize_ bug * adding a unit test * copyright
-
Benjamin Lefaudeux authored
* enabling disabled tests
-