- 31 May, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 26 May, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 30 Mar, 2022 1 commit
-
-
Paul Johnson authored
This is no longer needed since isort's version is 5.10 Also fix black version to 22.3.0 to fix issue with click dependency. Update files that now fail with new version of black {a = 2 ** 4} -> {a = 2**4}
-
- 14 Feb, 2022 1 commit
-
-
Min Xu authored
* update pytest versions * [test] test related changes - upgrade to newer pytorch versions - added function to make test more deterministic on A100 and TF32 - fixed some tests so that they are correctly skipped on a single GPU system * more fixes * formatting overly long lines * format * better test without trigger a warning * fix an optim state bug with newer pytorch - adam optimizer seems to return "step" as a singleton tensor now in the nightly build - this fixes it assumeing non-tensor value can still be loaded back by the optimizer * improve oss.py - use min_loss for regression checking is a bit more reliable - also increased the num epochs from 10 to 12 * small oss.py fix * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 11 Feb, 2022 1 commit
-
-
Min Xu authored
* skipping one more test * formatting * minor fix and copyright header * comment Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 14 Jan, 2022 1 commit
-
-
Anupam Bhatnagar authored
-
- 13 Jan, 2022 1 commit
-
-
Anupam Bhatnagar authored
* [skip ci] first commit * [skip ci] gradient scaler example * [skip ci] adding feed forward toy example * [skip ci] adding types * [skip ci] adding backward hook * [skip ci] update * [skip ci] working feed forward example * [skip ci] working feed forward example * [skip ci] use named_modules instead of named_children * [skip ci] adding new file * [skip ci] clean up * [skip ci] implement unscale function * [skip ci] implement unscale function * [skip ci] removing old file * [skip ci] removing some more old files * [skip ci] making unscale function generic * [skip ci] adding test for vision model * [skip ci] adding identity layer * [skip ci] cleanup files * [skip ci] refactoring * [skip ci] more refactoring * [skip ci] added functionality to update scale * [skip ci] data loader clean up * [skip ci] implemented inf checks and update scale functions * [skip ci]code clean up. added...
-
- 12 Nov, 2021 1 commit
-
-
Anupam Bhatnagar authored
* adding pre-commit files * applying pre-commit to all files * adding no-strict-optional argument to mypy in circle ci config * fix typo * updating python versions * [skip ci] remove extra args * adding python 3.9 * [skip ci] set pre-commit version in requirements-dev.txt * set CACHE_VERSION * move linters from circleci to github actions * update python version * update python version in benchmarks_2 * moving to python 3.9.7
-
- 10 Sep, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 06 Sep, 2021 1 commit
-
-
Min Xu authored
[cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup; pre-commit documentation (#744) * changelog; mypy; oss cleanup * more broadcast_object cleanup in FSDP * one more mypy fix * retire pytorch 1.6 from circleci, add new lightly, add 1.8 LTS and 1.9 stable release * update torch version for LTS * minor fixes * update cache key * trying newer gpu VMs * bump the cache * update to gpu.medium, which should be 2 GPUs * update nightly version * add pre-commit instruction * fixed CHANGELOG after merging * updated to newer nightly * retained the older broadcast function for older GPUs for oss.py * fixed a bug * added a comment * fixing a test for pytorch 1.10 * testing a fix * Update fairscale/optim/oss.py * Update CONTRIBUTING.md Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 27 Jul, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 26 Jun, 2021 1 commit
-
-
Pavel Belevich authored
-
- 08 May, 2021 1 commit
-
-
anj-s authored
* rename and move optim/utils.py * attach the new file
-
- 06 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 05 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* making APIs more private * linting
-
- 04 Apr, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 19 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* param buckets * unifying the buckets
-
- 18 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* enabling disabled tests
-
- 17 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 15 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* extending the current state_dict interface, make it possible to do everything in a single call, and to checkpoint on all ranks
-
- 12 Mar, 2021 1 commit
-
-
msbaines authored
-
- 11 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene * adding a proper unit test * adding a unit test for https://github.com/facebookresearch/fairscale/pull/510
-
- 09 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 05 Mar, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* change empty shard handling for OSS, do not rely on asserts * code review
-
- 04 Mar, 2021 1 commit
-
-
Min Xu authored
- cover them in terms of code path only - numerically, AdaScale is different on SDP/FSDP than DDP, mainly due to partial view of the gradients. - this doesn't mean it is definitely not useful but it is yet to be validated. - not going to spend too much time until we have a real use case.
-
- 23 Feb, 2021 1 commit
-
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 22 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding an assert + corresponding unit test * updated changelog * adjusting the adascale tests
-
- 19 Feb, 2021 1 commit
-
-
Min Xu authored
-
- 14 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* WIP, needs to be fixed ! * should be a fix, many thanks Weiyi Zheng * slightly better unit test, sorting the states on the way out * reproducing the issue from Weiyi in a unit test, and finally properly fixing * fixing unit test on pytorch1.5 - original loss diff 26.404895782470703 - 26.404342651367188
-
- 12 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Better unit testing * Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time * Enabling accumulation tests
-
- 05 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
fix a broken earlier commit, only worked for the first step
-
- 03 Feb, 2021 2 commits
-
-
Benjamin Lefaudeux authored
* precise skip, only if agent has only cpu
-
Min Xu authored
* [feat] Add AdaScaleWrapper - This enables a different API for wrapping an optimizer with AdaScale. - This also enables AdaScale to be wrapped by OSS. - However, OSS wrapping AdaScale results in different optimization, which future research will be needed to study its effects. testing: add unit tests. * addressed comment: typo
-
- 02 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding a test to prove the inter operability with upstream pytorch * updating the changelog * eager state pruning * pytorch 1.5 compat
-
- 29 Jan, 2021 1 commit
-
-
Min Xu authored
* [test]: test with py39 + torch 1.8 nightly * version fix * more fix * fix version function for nightly version * fix torch_pg build * invalidate cache * separate benchmark requirements * comment * fixed mypy * fixed a test
-
- 28 Jan, 2021 1 commit
-
-
Min Xu authored
* [test]: test adascale with oss * minor fix * add a small comment * refactor: moved find_tensor_by_shape * refactor: move test golden data into its own module * refactor: simplied the train function * refactor: added comments as suggested
-
- 27 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 20 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
-
- 11 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing * fixing oss backcompat, trying to fix rpc in old pytorch also * fixing the file based init in torch 1.5
-
- 08 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding a parity unit test * code review, better testing, use torch defaults and check for the loss, log world size
-