Commits · 5f484b3545f27eddb19d970fbe1d361b9c5f2b07 · OpenDAS / fairscale

24 Sep, 2022 1 commit

[Fix][FSDP] Don't remove post backward hooks for multiple backward fix (#1079) · f4fcee7e

Min Xu authored Sep 24, 2022



* tmp

* test again

* test again

* add new test

* clean up

* add test file to the testlist

* more comments

* add changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

f4fcee7e

02 May, 2022 1 commit

[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and... · 51b53ddb

Paul Johnson authored May 02, 2022

[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and SsdFlatParameterView (#974)

* [FSDP] fixing backward path for SsdFlatParameter and SsdFlatParameterView when overriding .data

* Get ssd_offload unit tests passing

* [FSDP] get all test_fsdp_offload tests passing w/ ssd_offload on

* Update changelog

51b53ddb

06 Apr, 2022 1 commit

Improvements to ssd_offload to support pickling/unpickling SsdTensorHandle... · 92f27daa

Paul Johnson authored Apr 06, 2022

Improvements to ssd_offload to support pickling/unpickling SsdTensorHandle (and derived classes) (#964)

Verified that FSDP wrapped models using ssd_offload checkpoint save and restore correctly

92f27daa

09 Mar, 2022 1 commit
- [chore] 0.4.6 release (#953) · 3e36cd07
  tmarkstrum authored Mar 09, 2022
```
* [chore] 0.4.6 release

* added the third party libs removed by precommit
```
  3e36cd07
08 Mar, 2022 1 commit

[chore] Fix copyright headers & fixed issue with mypy & NumPy versions in pre-commit (#951) · 8fa26ae4

Min Xu authored Mar 08, 2022



* copyright headers

* isort and pyproject.toml

* precommit and requirement for isort-seed-config

* mypy

* dummy change

* numpy version for pre-commit

* fix mypy issue caused by numpy
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8fa26ae4

03 Mar, 2022 1 commit

[fix] FSDP: EMA related fixes (#922) · 9f347f37

Min Xu authored Mar 03, 2022



* add an ignore file

* [fix] FSDP: handle the lazy_init better

- when state_dict and load_state_dict is called, let'em not change
  the lazy_init state.

* changelog

* longer timeout

* Revert "longer timeout"

This reverts commit 00cc145fe86210a0972a1e7ba4f37531b9e091eb.

* testing

* adding the failed test

* fix the global to local id

* formatting

* more complete fix and test

* minor fix for an assert

* update changelog

* remove an extra line

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* addressed review comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

9f347f37

23 Feb, 2022 1 commit

[fix][FSDP] Add support for saving optimizer state with expert replication (#936) · 40e7450f

anj-s authored Feb 23, 2022

* checkpoint tests

* checkpoint tests

* fix tests

* lint fixes

* remove prints

* lint fixes

* add comments

* add changelog

* more cleanup

* lint fix

40e7450f

15 Feb, 2022 1 commit

Update CHANGELOG.md (#935) · 9090bfdc

ruanslv authored Feb 15, 2022

* Update CHANGELOG.md

Adding https://github.com/facebookresearch/fairscale/pull/930

 to changelog

* Update CHANGELOG.md
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

9090bfdc

28 Jan, 2022 1 commit

[feat] add CosFace paper's LMCL to MEVO (#916) · 89e1ae5f

Min Xu authored Jan 27, 2022



* [feat] add CosFace paper's LMCL to MEVO

- added baseline algorithm to the reference kernel
- added MEVO version of LMCL
- added unit test to verify it is correct with respect to the reference as well as its memory usage

* updated changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

89e1ae5f

14 Jan, 2022 1 commit

[Chore]release 0.4.5 (#911) · 4a3bd93a

tmarkstrum authored Jan 14, 2022

* release 0.4.5

* added some content for the release

* fixed a format issue.

4a3bd93a

13 Jan, 2022 2 commits

[feature] [experimental] Layerwise Gradient Scaler (#879) · 52d066a2

Anupam Bhatnagar authored Jan 12, 2022

* [skip ci] first commit

* [skip ci] gradient scaler example

* [skip ci] adding feed forward toy example

* [skip ci] adding types

* [skip ci] adding backward hook

* [skip ci] update

* [skip ci] working feed forward example

* [skip ci] working feed forward example

* [skip ci] use named_modules instead of named_children

* [skip ci] adding new file

* [skip ci] clean up

* [skip ci] implement unscale function

* [skip ci] implement unscale function

* [skip ci] removing old file

* [skip ci] removing some more old files

* [skip ci] making unscale function generic

* [skip ci] adding test for vision model

* [skip ci] adding identity layer

* [skip ci] cleanup files

* [skip ci] refactoring

* [skip ci] more refactoring

* [skip ci] added functionality to update scale

* [skip ci] data loader clean up

* [skip ci] implemented inf checks and update scale functions

* [skip ci]code clean up. added...

52d066a2

[Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19

tmarkstrum authored Jan 12, 2022



* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* added changelog

* fixed some commit.

* added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.

* throw an error instead of rolling back to use default process group for reduce_scatter_process_group

* Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"

This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.

* added check for None to avoid unit test failure

* fixed an error to avoid the unit tests failure
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

fb4eca19

12 Jan, 2022 1 commit

[chore] Update the CHANGELOG to add details about the new feature that enables... · 0044372c

tmarkstrum authored Jan 11, 2022

[chore] Update the CHANGELOG to add details about the new feature that enables reduce_scatter overlap in backward propagation (#906)

* updated the change log

* improve the change log

0044372c

06 Jan, 2022 1 commit

FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb

four4fish authored Jan 06, 2022

* FullyShardedDataParallel: only return full state dict on rank 0

* Add flag and make rank 0 only optional

* Add tests

* Add docs

* address comments

* update comments

* update torch nightly version

* update torchvision number for torch nightly dependence

* add changelog

* Update CHANGELOG.md

* Update CHANGELOG.md

d3417ceb

05 Jan, 2022 1 commit

Enabling ssd_offload training basic tests. (#887) · c5e471bc

Paul Johnson authored Jan 05, 2022

* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
* Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
* Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
* Modifications from PR #887 review comments.
* Update Changelog

c5e471bc

21 Dec, 2021 3 commits
- [skip ci] updating date in changelog (#892) · 8397f766
  Anupam Bhatnagar authored Dec 21, 2021
  
  8397f766
- Changelog update (#891) · 8e770bac
  Anupam Bhatnagar authored Dec 21, 2021
```
* [skip ci] adding comments to changelog

* adding date to changelog

* [skip ci] minor edit
```
  8e770bac
- [Fix] - Finiteness check for all tensors (#890) · c3fc3894
  Anupam Bhatnagar authored Dec 21, 2021
```
* Finiteness check for all tensors

* [skip ci] updating changelog
```
  c3fc3894
02 Dec, 2021 1 commit

[fix] [FSDP] Do not lose original reshard_after_forward (#880) · 7c2c3e00

Min Xu authored Dec 02, 2021

* [fix] [FSDP] Do not lose original reshard_after_forward

- In a corner case we can lose this value
- Saving it and use it in the reset function fixed it
- A trivial case probably not worth a dedicated test for now

* added changelog

7c2c3e00

18 Nov, 2021 2 commits

[chore] 0.4.3 release (#860) · 68d10f73

Min Xu authored Nov 18, 2021



* [chore] 0.4.3 release

* update setup.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

68d10f73

[fix] [MEVO]: make mevo work with eval and optim_state checkpointing (#851) · 0db50ce5

Min Xu authored Nov 18, 2021



* [fix]: fix eval for shared weight FSDP

* fixing optim state saving

* add changelog

* reformat with newer local isort

* update test

* avoid computing reference state unless we are testing training

* added optim_state test

* make mypy happy

* move tests; maybe we need to CUDA memory related tests in the first of the lists
Co-authored-by: Min Xu <min.xu.public@gmail.com>

0db50ce5

17 Nov, 2021 2 commits

[feature] Add a OffloadConfig object to specify offloading params to disk. (#855) · ef194cd2
anj-s authored Nov 17, 2021
```
* fixed lint issues

* remove unused print statements

* add changelog entry

* [skip ci] fix lint errors
```
ef194cd2

Update changelog, removed meta.yml and requirements cleanup (#853) · 2bfa5a61

Anupam Bhatnagar authored Nov 17, 2021

* update changelog

* [skip ci] removed requirements-test.txt

* [skip ci] updating changelog

* [skip ci] add PR numbers

* replacing requirements-test.txt by requirements-dev.txt

* [skip ci] changing requirements-test to requirements-dev in pre-commit and requirements-benchmarks

* [skip ci] mark manual static analysis checks as deprecated

* empty commit to trigger ci

* [skip ci] updating changelog

* [skip ci] addressing comments

* addressing more comments

2bfa5a61

12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

08 Nov, 2021 3 commits

[chore] 0.4.2 release (#846) · b65ce6ff

Anupam Bhatnagar authored Nov 08, 2021

* [chore] 0.4.2 release

* updating torch version

* [skip ci] updating readme and requirements.txt

b65ce6ff

[feature]Add support for SSD offload with FSDP for eval workloads (#839) · d7c4aa52

anj-s authored Nov 08, 2021

* update release notes

* initial commit

* lint cleanup etc.

* helper functions; lint errors

* lint errors

* lint errors

* add back the boolean for named_parameters

* address comments and fix lint

* remove unused functions and class

* remove unused state

d7c4aa52

[feat] Gossip/SlowMo (#378) · 21464e05

Benjamin Lefaudeux authored Nov 08, 2021



Add SlowMo Distributed Data Parallel for clusters with slow interconnects
Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>

21464e05

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

01 Nov, 2021 1 commit

[feat] [FSDP]: add experimental support to shared weights (#836) · f2af4c66

Min Xu authored Nov 01, 2021



* added a new test, passing without shared weights

* tested weight sharing

* added the test to test list file

* extended to world_size = 2

* fixed test

* [feat]: add limited and experimental support for shared parameter

* fixed tests

* simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param

* fixed the case where linked param is not in separate FSDP

* changelog and remove old code
Co-authored-by: Min Xu <min.xu.public@gmail.com>

f2af4c66

27 Oct, 2021 1 commit

[fix]: Fixes an issue with pre_backward hook registering (#833) · 5da5c0eb

Min Xu authored Oct 27, 2021



* added the failing test

* fixed the bug

* fine-tune the condition

* typo

* typo

* changelog and added test to test files
Co-authored-by: Min Xu <min.xu.public@gmail.com>

5da5c0eb

20 Oct, 2021 1 commit
- [chore] Add log for the new experimental memory tracker feature. (#819) · ce2ad89e
  anj-s authored Oct 20, 2021
```
* add log for new memory tracker features

* add log for new memory tracker features
```
  ce2ad89e
20 Sep, 2021 1 commit
- [chore]0.4.1 release (#803) · 1b9be421
  tmarkstrum authored Sep 20, 2021
```
* [chore]0.4.1 release

* put more details in one change log
```
  1b9be421
13 Sep, 2021 1 commit
- [OSS] Fixing the fp16 broadcast and catching this case in the unit test (#795) · 180ab8c8
  Benjamin Lefaudeux authored Sep 13, 2021
  
  180ab8c8
12 Sep, 2021 1 commit

[fix] minor fixes for master branch (#792) · 31e36453

Min Xu authored Sep 12, 2021



* add changelog for previous commit

* add changelog for previous commit

* add changelog for previous commit

* fix a merge induced error
Co-authored-by: Min Xu <min.xu.public@gmail.com>

31e36453

05 Sep, 2021 1 commit

[fix] [FSDP] making sure we use full params for multiple backwards within an iteration (#775) · 95d31d4d

Min Xu authored Sep 05, 2021



* [bug] [FSDP] making sure we use full params for multiple backwards within an iteration

* changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

95d31d4d

12 Aug, 2021 2 commits
- add changelog for PRs submitted (#764) · d54e183c
  anj-s authored Aug 12, 2021
  
  d54e183c
- [minor] RELEASE.md and pre-commit (#762) · f2852ad7
  Min Xu authored Aug 12, 2021
```
* minor: changelog and pre-commit

* addressed comment

* update the release doc
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  f2852ad7
01 Aug, 2021 1 commit
- [chore] 0.4.0 release (#757) · 3e661603
  Min Xu authored Jul 31, 2021
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  3e661603
31 Jul, 2021 1 commit

FSDP: supporting gradient accumulation without no_sync context manager to save GPU memory (#752) · cd0f0b88

Myle Ott authored Jul 31, 2021



* Add test (broken) for gradient accumulation without no_sync context manager

* changelog

* no_sync to grad_acc renaming for tests

* clean up tmp files

* support grad acc without no_sync

* minor

* update changelog

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Better assertion from Sam.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

cd0f0b88

27 Jul, 2021 1 commit

[chore] 0.3.9 release (#750) · 61ece000

Min Xu authored Jul 27, 2021



* [chore] 0.3.9 release

* update changelog

* address comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

61ece000