- 05 Jul, 2022 1 commit
-
-
Riyasat Ohib authored
* [Fix] Restructure for wgit availability as a package * Preliminary implementation of wgit status * [Feat] Addition of wgit status 1. Functionalities to check the status of the repo. 2. Checks if file has been modified, whether changes added or added changes commited. * [test] Addition of tests for weigit status 1. Some minor refactors and docstring changes * [Fix] Changes in repo status test * [test] status test fix 1. made the test status printing order independent * [refactor] Metadata dirs mirroring chkpt paths, changes in wgit status 1. Metadata files are now created within wgit with directory structure mirroring the relative paths of the checkpoint/files they track. 2. Changes in status: 3 statuses now. 3. Changes in tests. 4. Some code refactoring. * [cleanup] minor changes in comments and cleanup
-
- 29 Jun, 2022 1 commit
-
-
Min Xu authored
Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 24 Jun, 2022 1 commit
-
-
Riyasat Ohib authored
weigit: Fixed file tracking with metadata. Changes in sha1_store for better encapsulation. Docstrings. (#1013) * [Feat] Fixed file tracking with metadata. Change in sha1_store for better encapsulation. Tests. 1. Adds metadata creation per added file and independently tracks version of each separate file added. That is, now creates separate metadata files for each file to be tracked. 2. Changes in reference tracking to accomodate the change in 1. 3. Somes changes in SHA1_store for better encapsulation. 4. Modified the tests to reflect above. * [Feat] 1. Added docstrings to the classes. 2. Added a recursively search for the weigit repo upto root. 3. Some refactor of the codes. * [Feat][Refactor] repo and sha1_store add modification and separation. Modification in reference tracking 1. Separation of add functionalities of repo.add and sha1_store.add. 2. Updated the reference tracking. 3. New tests and code refactor * [Fix] Sha1_store fix overlap in first two characters of sha1 hash. 1. Accept multiple sha1 hash's with same two starting characters and create directories accordingly. * [Fix] Minor refactoring and test fix * [Fix] Fix for pygit class initialization in cases when no .gitconfig file is available Co-authored-by:Riyasat Ohib <riohib@devfair0756.h2.fair>
-
- 15 Jun, 2022 1 commit
-
-
Crutcher Dunnavant authored
* Fix CI * ci pythonpath
-
- 14 Jun, 2022 1 commit
-
-
Riyasat Ohib authored
* [feat] Adds the implementaion for the wgit add functionality, with sha1 hash creation, reference tracking, dependency graph creation and all related functionalities for the wgit add method. * [feat] Adds the wgit add and wgit commit functionalities and major refactors. 1. Adds the wgit add and wgit commit functionalities to the api. 2. Introduces a new PyGit class that wraps the internal .wgit/.git repo. 3. Refactors the Repo class in the api, and introduces some methods. 4. .Refactors all the classes which no longer uses @staticmethods and now uses object istances instead. 5. Moved many of the directory path handling code from os.path to pathlib library. * [Feat] Combines the Repo and Weigit classes. Separate claases into separate modules. 1. Combines the functionalities of the WeiGit and Repo class into a single WeiGitRepo class. 2. Classes are now separated into their own modules. 3. Moved some functions and staticmethod to utils. 4. Adds a range of tests for add and commit functionalities of weigit. * [fix] adds a new test to the ci_test_list_3 * [fix] test fix * [fix] test fix * [Feat] Directory restructuring, type checking and some standardization 1. Restructured the directory and moved wgit to fairscale/experimental/wgit so that it can be found as a package when pip installed. 2. Added a range of type checking 3. Some refactors * [Feat][Refactor] Directory restructuring, test addition and type checking 1. Restructed the test directory 2. Added and modified a few wgit tests. 3. Added some type checking to the code * test fix * "setup fix and repo checking added in cli" * [Feat] Better initialization and error handling for init and wgit subcommands. Test reorg. * [refactor] Changes in classes, encapsulation and addition of PyGit test. * [Feat][Refactor] 1. Changed some class method arguments for better encapsulation for Sha1_store. 2. Moved sha1 hash calculation within sha1_store. 3. Some standardization and code clean up of unnecessary snippets. 4. Added new tests for the PyGit and Sha1_Store class.
-
- 12 Jun, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 01 Jun, 2022 1 commit
-
-
Riyasat Ohib authored
* [feat] Adding wgit within fairscale/experimental/wgit. * [feat] adding experimental wgit * [feat] wgit init functionalities and skeleton for the rest. * adapted the suggested changes * repo class working * [feat] wgit functionalities and skeleton. Addition of subparsers and repo class along with some changes. * [feat] wgit functionalities and skeleton, move to subparsers and addition of Repo Class * [feat] wgit functionalities and skeleton, move to subparsers and addition of Repo Class * [docs] changed a comment in .gitignore * [refactor] changed the sequene of tests in ci_test_list2.txt
-
- 31 May, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 30 May, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 26 May, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 25 May, 2022 1 commit
-
-
Riyasat Ohib authored
* [feat] Adding wgit within fairscale/experimental/wgit. * [feat] adding experimental wgit
-
- 02 May, 2022 1 commit
-
-
Paul Johnson authored
[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and SsdFlatParameterView (#974) * [FSDP] fixing backward path for SsdFlatParameter and SsdFlatParameterView when overriding .data * Get ssd_offload unit tests passing * [FSDP] get all test_fsdp_offload tests passing w/ ssd_offload on * Update changelog
-
- 26 Apr, 2022 1 commit
-
-
Min Xu authored
Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 06 Apr, 2022 1 commit
-
-
Paul Johnson authored
Improvements to ssd_offload to support pickling/unpickling SsdTensorHandle (and derived classes) (#964) Verified that FSDP wrapped models using ssd_offload checkpoint save and restore correctly
-
- 30 Mar, 2022 1 commit
-
-
Paul Johnson authored
This is no longer needed since isort's version is 5.10 Also fix black version to 22.3.0 to fix issue with click dependency. Update files that now fail with new version of black {a = 2 ** 4} -> {a = 2**4}
-
- 03 Mar, 2022 1 commit
-
-
Min Xu authored
* add an ignore file * [fix] FSDP: handle the lazy_init better - when state_dict and load_state_dict is called, let'em not change the lazy_init state. * changelog * longer timeout * Revert "longer timeout" This reverts commit 00cc145fe86210a0972a1e7ba4f37531b9e091eb. * testing * adding the failed test * fix the global to local id * formatting * more complete fix and test * minor fix for an assert * update changelog * remove an extra line * Update fairscale/nn/data_parallel/fsdp_optim_utils.py Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com> * Update fairscale/nn/data_parallel/fsdp_optim_utils.py Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com> * Update fairscale/nn/data_parallel/fsdp_optim_utils.py Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com> * addressed review comments Co-authored-by:
Min Xu <min.xu.public@gmail.com> Co-authored-by:
anj-s <32556631+anj-s@users.noreply.github.com>
-
- 23 Feb, 2022 2 commits
- 22 Feb, 2022 1 commit
-
-
anj-s authored
* add benchmarks for fsdp * fix lint errors * clean up * clean up unused flags * add the benchmarks * remove unused args * fix lint errors * fix lint errors * update command line * add support for multiple devices * try full fp16 mode * try full fp16 mode * lint errors * merge main * lint errors * lint errors * lint error * update intersphinx mapping for numpy * update intersphinx mapping for numpy * skip test * added golden configs * use synthetic benchmarks * fix fn name * fix cuda device id * fix verify * lint fix
-
- 15 Feb, 2022 1 commit
-
-
ruanslv authored
* [fix] Add option to wrap root module in auto_wrap * Fix unit-test comment * adding a few more tests to make expected behavior clear * move changes to wrap policy as suggested * set default to false * revert pre-commit change * revert pre-commit change 2 Co-authored-by:Ruan Silva <ruanrms@fb.com>
-
- 14 Feb, 2022 1 commit
-
-
Min Xu authored
* update pytest versions * [test] test related changes - upgrade to newer pytorch versions - added function to make test more deterministic on A100 and TF32 - fixed some tests so that they are correctly skipped on a single GPU system * more fixes * formatting overly long lines * format * better test without trigger a warning * fix an optim state bug with newer pytorch - adam optimizer seems to return "step" as a singleton tensor now in the nightly build - this fixes it assumeing non-tensor value can still be loaded back by the optimizer * improve oss.py - use min_loss for regression checking is a bit more reliable - also increased the num epochs from 10 to 12 * small oss.py fix * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 11 Feb, 2022 1 commit
-
-
Min Xu authored
* skipping one more test * formatting * minor fix and copyright header * comment Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 08 Feb, 2022 1 commit
-
-
anj-s authored
* update intersphinx mapping for numpy * update intersphinx mapping for numpy * update pytorch mapping and disable test
-
- 28 Jan, 2022 1 commit
-
-
Min Xu authored
* [feat] add CosFace paper's LMCL to MEVO - added baseline algorithm to the reference kernel - added MEVO version of LMCL - added unit test to verify it is correct with respect to the reference as well as its memory usage * updated changelog Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 25 Jan, 2022 1 commit
-
-
Min Xu authored
* [fix] reduce unit test memory * set seed in CI * fix random seed function * giving up CI, //sigh
-
- 14 Jan, 2022 1 commit
-
-
Anupam Bhatnagar authored
-
- 13 Jan, 2022 2 commits
-
-
Anupam Bhatnagar authored
* [skip ci] first commit * [skip ci] gradient scaler example * [skip ci] adding feed forward toy example * [skip ci] adding types * [skip ci] adding backward hook * [skip ci] update * [skip ci] working feed forward example * [skip ci] working feed forward example * [skip ci] use named_modules instead of named_children * [skip ci] adding new file * [skip ci] clean up * [skip ci] implement unscale function * [skip ci] implement unscale function * [skip ci] removing old file * [skip ci] removing some more old files * [skip ci] making unscale function generic * [skip ci] adding test for vision model * [skip ci] adding identity layer * [skip ci] cleanup files * [skip ci] refactoring * [skip ci] more refactoring * [skip ci] added functionality to update scale * [skip ci] data loader clean up * [skip ci] implemented inf checks and update scale functions * [skip ci]code clean up. added...
-
tmarkstrum authored
* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> * added changelog * fixed some commit. * added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size. * throw an error instead of rolling back to use default process group for reduce_scatter_process_group * Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group" This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9. * added check for None to avoid unit test failure * fixed an error to avoid the unit tests failure Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com>
-
- 07 Jan, 2022 1 commit
-
-
tmarkstrum authored
* enable reduce scatter overlap with other operations * fixed unit tests and added docstrings for the new parameters for fsdp * fixed more unit tests * fixed unit tests * avoided the pickle error on process_group_reduce_scatter * removed an unnecessary parameter in unit tests * remove unnecessary prints * fixed the docstring * skipped the test_offload unit test because this unit test failed in the main branch * removed the enable_reduce_scatter_overlap API parameter * added doc string for the defualt value of process_group_reduce_scatter parameter * fixed a syntax bug * fixed a bug which cause unitest failure * removed the all_gather in the ProcessGroupName enum * added more comment * changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter
-
- 06 Jan, 2022 1 commit
-
-
four4fish authored
* FullyShardedDataParallel: only return full state dict on rank 0 * Add flag and make rank 0 only optional * Add tests * Add docs * address comments * update comments * update torch nightly version * update torchvision number for torch nightly dependence * add changelog * Update CHANGELOG.md * Update CHANGELOG.md
-
- 05 Jan, 2022 1 commit
-
-
Paul Johnson authored
* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py. * Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor * Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper * Modifications from PR #887 review comments. * Update Changelog
-
- 13 Dec, 2021 1 commit
-
-
Min Xu authored
- During eval, we will fallback to just output projection without fusing - added unit test to ensure the shape is correct
-
- 06 Dec, 2021 1 commit
-
-
Freddy Snijder authored
Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) (#881) * Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) * Styling fixes * Updated the test to be independent of the Huggingface transformers package * Added test for issue #876 * Small error message fix * Skip test when CUDA is not available * Fixed naming of model
-
- 18 Nov, 2021 1 commit
-
-
Min Xu authored
* [fix]: fix eval for shared weight FSDP * fixing optim state saving * add changelog * reformat with newer local isort * update test * avoid computing reference state unless we are testing training * added optim_state test * make mypy happy * move tests; maybe we need to CUDA memory related tests in the first of the lists Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 17 Nov, 2021 1 commit
-
-
anj-s authored
* fixed lint issues * remove unused print statements * add changelog entry * [skip ci] fix lint errors
-
- 15 Nov, 2021 1 commit
-
-
Anupam Bhatnagar authored
* first commit * sharded scaler hitting nan assertions * adding test for sharded grad scaler without cpu offload * ddp grad scaler and fsdp sharded grad scaler test failing * removing test_output * fix no cpu offload test * changing optimizer from OSS to SGD * all tests passing, code cleanup pending * code cleanup * fix pyproject.toml * removing .isort.cfg * running isort linter * resolving isort issues * resolving black linter issue * resolving mypy issues * fix import statement * fix mypy error * modifying import statement * adding pytorch version requirement * fixing pytest skip test decorator * apply version guard for ShardedGradScaler * removing test_fsdp_grad_scaler * increasing num_epochs for ShardedGradScaler so that updates are not skipped * adding support for torch 1.8 * minor edit * [skip ci] more torch 1.8 changes * parametrizing the tests * cleanup code with linters * [skip ci] update doc string * [skip ci] addressing some more comments
-
- 12 Nov, 2021 1 commit
-
-
Anupam Bhatnagar authored
* adding pre-commit files * applying pre-commit to all files * adding no-strict-optional argument to mypy in circle ci config * fix typo * updating python versions * [skip ci] remove extra args * adding python 3.9 * [skip ci] set pre-commit version in requirements-dev.txt * set CACHE_VERSION * move linters from circleci to github actions * update python version * update python version in benchmarks_2 * moving to python 3.9.7
-
- 09 Nov, 2021 1 commit
-
-
Anupam Bhatnagar authored
* CI config changes * changing params for failing tests * [skip ci] minor edit
-
- 08 Nov, 2021 2 commits
-
-
anj-s authored
* update release notes * initial commit * lint cleanup etc. * helper functions; lint errors * lint errors * lint errors * add back the boolean for named_parameters * address comments and fix lint * remove unused functions and class * remove unused state
-
Benjamin Lefaudeux authored
Add SlowMo Distributed Data Parallel for clusters with slow interconnects Co-authored-by:Vinayak Tantia <tantia.vinayak1@gmail.com>
-