Commits · 429f3d318b6ca9b970568e19170af9c5d77af010 · OpenDAS / fairscale

23 Sep, 2022 1 commit

[fix] better handling non-flatten in FSDP (#1072) · 429f3d31

Min Xu authored Sep 23, 2022



* [fix] better handling non-flatten in FSDP

- see the detailed comment about that backward firing case
- also minor debugging help in FSDP
- also minor fix in FPW's state dict

* [feat] disallow reset_parameters by default

* [feat] adding fsdp_instances API - useful in check wrapping by user code

* [fix] one line fix but more than a day of debugging

* fixed the case of loading combined check with empty fsdp instances

* fixed another bug around state loading the root/nonroot module full param caching due to not resharding after forward

* [feat] support .half and .float better

* fixed a bug in gather optim state losses extra keys from the original state_dict

* fixed a test failure in mixed precision

* fixed another bug affecting no_sync grad acc

* fixed a bug and a test in fsdp optim state

* fixed another corner case

* added a comment

* skip ssd offload tests

* skip fsdp one for ssd overload
Co-authored-by: Min Xu <min.xu.public@gmail.com>

429f3d31

13 Sep, 2022 1 commit
- [bug] fix optim state gather when there is empty FSDP instances (#1071) · d8fc94d9
  Min Xu authored Sep 13, 2022
```
* [bug] fix optim state gather when there is empty FSDP instances

* fixes an anssert and a test bug
```
  d8fc94d9
12 Jun, 2022 1 commit
- Move f/utils => f/internal; move testing libs to fair_dev/testing (#1004) · 2350968e
  Crutcher Dunnavant authored Jun 12, 2022
  
  2350968e
30 Mar, 2022 1 commit

Remove sort_iseed_config and related dependencies. (#969) · 72f373c1

Paul Johnson authored Mar 30, 2022

This is no longer needed since isort's version is 5.10

Also fix black version to 22.3.0 to fix issue with click
dependency.

Update files that now fail with new version of black {a = 2 ** 4} ->
{a = 2**4}

72f373c1

03 Mar, 2022 1 commit

[fix] FSDP: EMA related fixes (#922) · 9f347f37

Min Xu authored Mar 03, 2022



* add an ignore file

* [fix] FSDP: handle the lazy_init better

- when state_dict and load_state_dict is called, let'em not change
  the lazy_init state.

* changelog

* longer timeout

* Revert "longer timeout"

This reverts commit 00cc145fe86210a0972a1e7ba4f37531b9e091eb.

* testing

* adding the failed test

* fix the global to local id

* formatting

* more complete fix and test

* minor fix for an assert

* update changelog

* remove an extra line

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* addressed review comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

9f347f37

23 Feb, 2022 1 commit

[fix][FSDP] Add support for saving optimizer state with expert replication (#936) · 40e7450f

anj-s authored Feb 23, 2022

* checkpoint tests

* checkpoint tests

* fix tests

* lint fixes

* remove prints

* lint fixes

* add comments

* add changelog

* more cleanup

* lint fix

40e7450f

14 Feb, 2022 1 commit

[chore] [cleanup]: pytest, pytorch new versions, fix tests (#933) · fae29959

Min Xu authored Feb 14, 2022



* update pytest versions

* [test] test related changes

- upgrade to newer pytorch versions
- added function to make test more deterministic on A100 and TF32
- fixed some tests so that they are correctly skipped on a single GPU system

* more fixes

* formatting overly long lines

* format

* better test without trigger a warning

* fix an optim state bug with newer pytorch

- adam optimizer seems to return "step" as a singleton tensor now in the
nightly build
- this fixes it assumeing non-tensor value can still be loaded back by
the optimizer

* improve oss.py

- use min_loss for regression checking is a bit more reliable
- also increased the num epochs from 10 to 12

* small oss.py fix

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

fae29959

12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

06 Sep, 2021 1 commit

[cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup;... · 3ecf76f4

Min Xu authored Sep 05, 2021


[cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup; pre-commit documentation (#744)

* changelog; mypy; oss cleanup

* more broadcast_object cleanup in FSDP

* one more mypy fix

* retire pytorch 1.6 from circleci, add new lightly, add 1.8 LTS and 1.9 stable release

* update torch version for LTS

* minor fixes

* update cache key

* trying newer gpu VMs

* bump the cache

* update to gpu.medium, which should be 2 GPUs

* update nightly version

* add pre-commit instruction

* fixed CHANGELOG after merging

* updated to newer nightly

* retained the older broadcast function for older GPUs for oss.py

* fixed a bug

* added a comment

* fixing a test for pytorch 1.10

* testing a fix

* Update fairscale/optim/oss.py

* Update CONTRIBUTING.md
Co-authored-by: Min Xu <min.xu.public@gmail.com>

3ecf76f4

08 May, 2021 1 commit
- [chore] Rename and move utils.py from optim/ to utils/ (#669) · 5739930f
  anj-s authored May 07, 2021
```
* rename and move optim/utils.py

* attach the new file
```
  5739930f
13 Apr, 2021 1 commit
- [FSDP] use all_gather for 10X OSD consolidation speedup (#595) · a82825db
  Sam Shleifer authored Apr 13, 2021
  
  a82825db
08 Apr, 2021 1 commit
- [fix] [FSDP] optim state dict should be completely on CPU (#590) · a6549be7
  Sam Shleifer authored Apr 08, 2021
  
  a6549be7
04 Apr, 2021 1 commit
- [FSDP] add no_broadcast_optim_state option (#560) · 1fcbd624
  Sam Shleifer authored Apr 04, 2021
  
  1fcbd624
25 Mar, 2021 1 commit
- [FSDP][feature] optimizer state dict save and load (#537) · 9474d75d
  Sam Shleifer authored Mar 25, 2021
```
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
```
  9474d75d