- 24 Sep, 2022 2 commits
-
-
Min Xu authored
* tmp * test again * test again * add new test * clean up * add test file to the testlist * more comments * add changelog Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
Min Xu authored
Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 23 Sep, 2022 2 commits
-
-
Min Xu authored
- Fixes from Benjamin. Original commit msg: - Fixes #1041. I just had a minute or two, hoping that it's enough :) Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
Min Xu authored
* [fix] better handling non-flatten in FSDP - see the detailed comment about that backward firing case - also minor debugging help in FSDP - also minor fix in FPW's state dict * [feat] disallow reset_parameters by default * [feat] adding fsdp_instances API - useful in check wrapping by user code * [fix] one line fix but more than a day of debugging * fixed the case of loading combined check with empty fsdp instances * fixed another bug around state loading the root/nonroot module full param caching due to not resharding after forward * [feat] support .half and .float better * fixed a bug in gather optim state losses extra keys from the original state_dict * fixed a test failure in mixed precision * fixed another bug affecting no_sync grad acc * fixed a bug and a test in fsdp optim state * fixed another corner case * added a comment * skip ssd offload tests * skip fsdp one for ssd overload Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 13 Sep, 2022 1 commit
-
-
Min Xu authored
* [bug] fix optim state gather when there is empty FSDP instances * fixes an anssert and a test bug
-
- 08 Aug, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 12 Jun, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 26 May, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 02 May, 2022 1 commit
-
-
Paul Johnson authored
[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and SsdFlatParameterView (#974) * [FSDP] fixing backward path for SsdFlatParameter and SsdFlatParameterView when overriding .data * Get ssd_offload unit tests passing * [FSDP] get all test_fsdp_offload tests passing w/ ssd_offload on * Update changelog
-
- 26 Apr, 2022 1 commit
-
-
Min Xu authored
Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 06 Apr, 2022 1 commit
-
-
Paul Johnson authored
Improvements to ssd_offload to support pickling/unpickling SsdTensorHandle (and derived classes) (#964) Verified that FSDP wrapped models using ssd_offload checkpoint save and restore correctly
-
- 30 Mar, 2022 1 commit
-
-
Paul Johnson authored
This is no longer needed since isort's version is 5.10 Also fix black version to 22.3.0 to fix issue with click dependency. Update files that now fail with new version of black {a = 2 ** 4} -> {a = 2**4}
-
- 03 Mar, 2022 1 commit
-
-
Min Xu authored
* add an ignore file * [fix] FSDP: handle the lazy_init better - when state_dict and load_state_dict is called, let'em not change the lazy_init state. * changelog * longer timeout * Revert "longer timeout" This reverts commit 00cc145fe86210a0972a1e7ba4f37531b9e091eb. * testing * adding the failed test * fix the global to local id * formatting * more complete fix and test * minor fix for an assert * update changelog * remove an extra line * Update fairscale/nn/data_parallel/fsdp_optim_utils.py Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com> * Update fairscale/nn/data_parallel/fsdp_optim_utils.py Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com> * Update fairscale/nn/data_parallel/fsdp_optim_utils.py Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com> * addressed review comments Co-authored-by:
Min Xu <min.xu.public@gmail.com> Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com>
-
- 23 Feb, 2022 2 commits
- 22 Feb, 2022 1 commit
-
-
anj-s authored
* add benchmarks for fsdp * fix lint errors * clean up * clean up unused flags * add the benchmarks * remove unused args * fix lint errors * fix lint errors * update command line * add support for multiple devices * try full fp16 mode * try full fp16 mode * lint errors * merge main * lint errors * lint errors * lint error * update intersphinx mapping for numpy * update intersphinx mapping for numpy * skip test * added golden configs * use synthetic benchmarks * fix fn name * fix cuda device id * fix verify * lint fix
-
- 14 Feb, 2022 1 commit
-
-
Min Xu authored
* update pytest versions * [test] test related changes - upgrade to newer pytorch versions - added function to make test more deterministic on A100 and TF32 - fixed some tests so that they are correctly skipped on a single GPU system * more fixes * formatting overly long lines * format * better test without trigger a warning * fix an optim state bug with newer pytorch - adam optimizer seems to return "step" as a singleton tensor now in the nightly build - this fixes it assumeing non-tensor value can still be loaded back by the optimizer * improve oss.py - use min_loss for regression checking is a bit more reliable - also increased the num epochs from 10 to 12 * small oss.py fix * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 11 Feb, 2022 1 commit
-
-
Min Xu authored
* skipping one more test * formatting * minor fix and copyright header * comment Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 08 Feb, 2022 1 commit
-
-
anj-s authored
* update intersphinx mapping for numpy * update intersphinx mapping for numpy * update pytorch mapping and disable test
-
- 25 Jan, 2022 1 commit
-
-
Min Xu authored
* [fix] reduce unit test memory * set seed in CI * fix random seed function * giving up CI, //sigh
-
- 13 Jan, 2022 1 commit
-
-
tmarkstrum authored
* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> * added changelog * fixed some commit. * added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size. * throw an error instead of rolling back to use default process group for reduce_scatter_process_group * Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group" This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9. * added check for None to avoid unit test failure * fixed an error to avoid the unit tests failure Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com>
-
- 07 Jan, 2022 1 commit
-
-
tmarkstrum authored
* enable reduce scatter overlap with other operations * fixed unit tests and added docstrings for the new parameters for fsdp * fixed more unit tests * fixed unit tests * avoided the pickle error on process_group_reduce_scatter * removed an unnecessary parameter in unit tests * remove unnecessary prints * fixed the docstring * skipped the test_offload unit test because this unit test failed in the main branch * removed the enable_reduce_scatter_overlap API parameter * added doc string for the defualt value of process_group_reduce_scatter parameter * fixed a syntax bug * fixed a bug which cause unitest failure * removed the all_gather in the ProcessGroupName enum * added more comment * changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter
-
- 06 Jan, 2022 1 commit
-
-
four4fish authored
* FullyShardedDataParallel: only return full state dict on rank 0 * Add flag and make rank 0 only optional * Add tests * Add docs * address comments * update comments * update torch nightly version * update torchvision number for torch nightly dependence * add changelog * Update CHANGELOG.md * Update CHANGELOG.md
-
- 05 Jan, 2022 1 commit
-
-
Paul Johnson authored
* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py. * Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor * Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper * Modifications from PR #887 review comments. * Update Changelog
-
- 06 Dec, 2021 1 commit
-
-
Freddy Snijder authored
Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) (#881) * Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) * Styling fixes * Updated the test to be independent of the Huggingface transformers package * Added test for issue #876 * Small error message fix * Skip test when CUDA is not available * Fixed naming of model
-
- 18 Nov, 2021 1 commit
-
-
Min Xu authored
* [fix]: fix eval for shared weight FSDP * fixing optim state saving * add changelog * reformat with newer local isort * update test * avoid computing reference state unless we are testing training * added optim_state test * make mypy happy * move tests; maybe we need to CUDA memory related tests in the first of the lists Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 17 Nov, 2021 1 commit
-
-
anj-s authored
* fixed lint issues * remove unused print statements * add changelog entry * [skip ci] fix lint errors
-
- 15 Nov, 2021 1 commit
-
-
Anupam Bhatnagar authored
* first commit * sharded scaler hitting nan assertions * adding test for sharded grad scaler without cpu offload * ddp grad scaler and fsdp sharded grad scaler test failing * removing test_output * fix no cpu offload test * changing optimizer from OSS to SGD * all tests passing, code cleanup pending * code cleanup * fix pyproject.toml * removing .isort.cfg * running isort linter * resolving isort issues * resolving black linter issue * resolving mypy issues * fix import statement * fix mypy error * modifying import statement * adding pytorch version requirement * fixing pytest skip test decorator * apply version guard for ShardedGradScaler * removing test_fsdp_grad_scaler * increasing num_epochs for ShardedGradScaler so that updates are not skipped * adding support for torch 1.8 * minor edit * [skip ci] more torch 1.8 changes * parametrizing the tests * cleanup code with linters * [skip ci] update doc string * [skip ci] addressing some more comments
-
- 12 Nov, 2021 1 commit
-
-
Anupam Bhatnagar authored
* adding pre-commit files * applying pre-commit to all files * adding no-strict-optional argument to mypy in circle ci config * fix typo * updating python versions * [skip ci] remove extra args * adding python 3.9 * [skip ci] set pre-commit version in requirements-dev.txt * set CACHE_VERSION * move linters from circleci to github actions * update python version * update python version in benchmarks_2 * moving to python 3.9.7
-
- 09 Nov, 2021 1 commit
-
-
Anupam Bhatnagar authored
* CI config changes * changing params for failing tests * [skip ci] minor edit
-
- 08 Nov, 2021 1 commit
-
-
anj-s authored
* update release notes * initial commit * lint cleanup etc. * helper functions; lint errors * lint errors * lint errors * add back the boolean for named_parameters * address comments and fix lint * remove unused functions and class * remove unused state
-
- 05 Nov, 2021 1 commit
-
-
Min Xu authored
* [feat] MEVO kernel - initial import from min/softmax and min/testing branches - need to rename and further cleanup * only test with newer pytorch * renamed and added comments and code cleanup * rename and reduce test memory * testing * minor fixing * fixing * more fix * changelog * more 1.7 and 1.8 paper cuts * remove dead code * addressed Benjamin's comments * addressed more comments Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 01 Nov, 2021 1 commit
-
-
Min Xu authored
* added a new test, passing without shared weights * tested weight sharing * added the test to test list file * extended to world_size = 2 * fixed test * [feat]: add limited and experimental support for shared parameter * fixed tests * simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param * fixed the case where linked param is not in separate FSDP * changelog and remove old code Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 28 Oct, 2021 1 commit
-
-
Min Xu authored
* [fix] fix test on main * [fix] fix test on main Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 27 Oct, 2021 3 commits
-
-
anj-s authored
* remove offload dependency on fp16 * update python version for cpu tess * run CPU tests with updated PyTorch version * split changes * revert tests config * fix lint errors * update nightly and test PyTorch versions * skip failing multiprocess pipe test * always skip test * always skip test * always skip test * lint error * skip unsupported versions * improve skip message * lint errors * modify docs * add tests * fix test failures * modify comments * fix lint errors * fix lint errors
-
Min Xu authored
* checkpoint + nonflat + mixed_precision * make tests pass with expected errors * addressed comments * add a comment Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
Min Xu authored
* added the failing test * fixed the bug * fine-tune the condition * typo * typo * changelog and added test to test files Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 12 Sep, 2021 1 commit
-
-
Darryl Barnhart authored
* [fix] FSDP intra-backwards gradient accumulation. Ensure gradient reduction accumulates into the unsharded gradient tensor within a backwards pass. This matters when an FSDP module is called multiple times within a forward pass, and reduction is _not_ deferred using activation checkpoint forward counters, bucketing or some other mechanism. Closes #780 * [refactor] Remove forward counters. Comments. Removed forward counters from the activation checkpointing utility, now that FSDP does not require them for correct operation. Add more detailed comment about memory usage behaviour with gradient reduction. * [refactor] Delete deprecated forward counter usage. * [refactor] Add state assertion as end of pre-backward hook.
-
- 06 Sep, 2021 1 commit
-
-
Min Xu authored
[cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup; pre-commit documentation (#744) * changelog; mypy; oss cleanup * more broadcast_object cleanup in FSDP * one more mypy fix * retire pytorch 1.6 from circleci, add new lightly, add 1.8 LTS and 1.9 stable release * update torch version for LTS * minor fixes * update cache key * trying newer gpu VMs * bump the cache * update to gpu.medium, which should be 2 GPUs * update nightly version * add pre-commit instruction * fixed CHANGELOG after merging * updated to newer nightly * retained the older broadcast function for older GPUs for oss.py * fixed a bug * added a comment * fixing a test for pytorch 1.10 * testing a fix * Update fairscale/optim/oss.py * Update CONTRIBUTING.md Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 12 Aug, 2021 1 commit
-
-
anj-s authored
* add additional assert for checking if the requires_grad field is set. * fix lint errors * add unit tests and address comments
-