Commits · 52d066a2480807d0aba4b8025c0b17aa21dedfff · OpenDAS / fairscale

13 Jan, 2022 2 commits

[feature] [experimental] Layerwise Gradient Scaler (#879) · 52d066a2

Anupam Bhatnagar authored Jan 12, 2022

* [skip ci] first commit

* [skip ci] gradient scaler example

* [skip ci] adding feed forward toy example

* [skip ci] adding types

* [skip ci] adding backward hook

* [skip ci] update

* [skip ci] working feed forward example

* [skip ci] working feed forward example

* [skip ci] use named_modules instead of named_children

* [skip ci] adding new file

* [skip ci] clean up

* [skip ci] implement unscale function

* [skip ci] implement unscale function

* [skip ci] removing old file

* [skip ci] removing some more old files

* [skip ci] making unscale function generic

* [skip ci] adding test for vision model

* [skip ci] adding identity layer

* [skip ci] cleanup files

* [skip ci] refactoring

* [skip ci] more refactoring

* [skip ci] added functionality to update scale

* [skip ci] data loader clean up

* [skip ci] implemented inf checks and update scale functions

* [skip ci]code clean up. added...

52d066a2

[Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19

tmarkstrum authored Jan 12, 2022



* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* added changelog

* fixed some commit.

* added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.

* throw an error instead of rolling back to use default process group for reduce_scatter_process_group

* Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"

This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.

* added check for None to avoid unit test failure

* fixed an error to avoid the unit tests failure
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

fb4eca19

12 Jan, 2022 1 commit

[chore] Update the CHANGELOG to add details about the new feature that enables... · 0044372c

tmarkstrum authored Jan 11, 2022

[chore] Update the CHANGELOG to add details about the new feature that enables reduce_scatter overlap in backward propagation (#906)

* updated the change log

* improve the change log

0044372c

07 Jan, 2022 1 commit

[FSDP] Enable FSDP reduce scatter overlap (#897) · 0a526bcb

tmarkstrum authored Jan 07, 2022

* enable reduce scatter overlap with other operations

* fixed unit tests and added docstrings for the new parameters for fsdp

* fixed more unit tests

* fixed unit tests

* avoided the pickle error on process_group_reduce_scatter

* removed an unnecessary parameter in unit tests

* remove unnecessary prints

* fixed the docstring

* skipped the test_offload unit test because this unit test failed in the main branch

* removed the enable_reduce_scatter_overlap API parameter

* added doc string for the defualt value of process_group_reduce_scatter parameter

* fixed a syntax bug

* fixed a bug which cause unitest failure

* removed the all_gather in the ProcessGroupName enum

* added more comment

* changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter

0a526bcb

06 Jan, 2022 2 commits

fix trailing space issue (#903) · 02a8913c
tmarkstrum authored Jan 06, 2022

02a8913c

FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb

four4fish authored Jan 06, 2022

* FullyShardedDataParallel: only return full state dict on rank 0

* Add flag and make rank 0 only optional

* Add tests

* Add docs

* address comments

* update comments

* update torch nightly version

* update torchvision number for torch nightly dependence

* add changelog

* Update CHANGELOG.md

* Update CHANGELOG.md

d3417ceb

05 Jan, 2022 1 commit

Enabling ssd_offload training basic tests. (#887) · c5e471bc

Paul Johnson authored Jan 05, 2022

* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
* Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
* Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
* Modifications from PR #887 review comments.
* Update Changelog

c5e471bc

24 Dec, 2021 1 commit
- [skip ci] update release.md (#896) · 541bb8c9
  Anupam Bhatnagar authored Dec 23, 2021
```
* [skip ci] update release.md

* [skip ci] minor edit
```
  541bb8c9
21 Dec, 2021 5 commits

0.4.4 release · 38af6d32
Anupam Bhatnagar authored Dec 21, 2021

38af6d32
[skip ci] updating date in changelog (#892) · 8397f766
Anupam Bhatnagar authored Dec 21, 2021

8397f766

Changelog update (#891) · 8e770bac

Anupam Bhatnagar authored Dec 21, 2021

* [skip ci] adding comments to changelog

* adding date to changelog

* [skip ci] minor edit

8e770bac

[Fix] - Finiteness check for all tensors (#890) · c3fc3894
Anupam Bhatnagar authored Dec 21, 2021
```
* Finiteness check for all tensors

* [skip ci] updating changelog
```
c3fc3894

Release automation (#888) · 49eacf12

Anupam Bhatnagar authored Dec 21, 2021

* [skip ci] first commit to automate release process

* empty commit

* fix syntax

* fix next_version value

* fixing more syntax

* remove uses

* fix

* fixed path in setup.py

* trying a basic example

* adding branch

* change release to name

* adding first step

* remove push trigger

* change order in ON section

* modifying manual workflow

* adding fairscale release workflow

* removing unused workflows

* replacing values with secrets

* fixing __version__ in __init__.py

* cleanup

* restoring import statement

49eacf12

16 Dec, 2021 1 commit

Added warn_on_trainable_params_changed constructor parameter to allow the user... · 99163d4f

Freddy Snijder authored Dec 16, 2021

Added warn_on_trainable_params_changed constructor parameter to allow the user to suppress the warning on trainable parameters changed (#886)

* Added warn_on_trainable_params_changed constructor parameter to allow the user to suppress the warning on trainable parameters changed; the default is True and thus the default behavior is unchanged

* Addded parameter documentation

99163d4f

13 Dec, 2021 1 commit

[feat] support eval in mevo (#884) · 56add6d5

Min Xu authored Dec 13, 2021

- During eval, we will fallback to just output projection without fusing
- added unit test to ensure the shape is correct

56add6d5

06 Dec, 2021 1 commit

Fix for Key Error that can happen in certain FSDP wrapping scenarios of... · e6acdcc3

Freddy Snijder authored Dec 06, 2021

Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) (#881)

* Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876)

* Styling fixes

* Updated the test to be independent of the Huggingface transformers package

* Added test for issue #876

* Small error message fix

* Skip test when CUDA is not available

* Fixed naming of model

e6acdcc3

02 Dec, 2021 5 commits
- [fix] [FSDP] Do not lose original reshard_after_forward (#880) · 7c2c3e00
  Min Xu authored Dec 02, 2021
```
* [fix] [FSDP] Do not lose original reshard_after_forward

- In a corner case we can lose this value
- Saving it and use it in the reset function fixed it
- A trivial case probably not worth a dedicated test for now

* added changelog
```
  7c2c3e00
- Update bug-report.md · 1eccb92d
  Min Xu authored Dec 02, 2021
  
  1eccb92d
- Update feature-request.md · f177f80c
  Min Xu authored Dec 02, 2021
  
  f177f80c
- Update questions-help-support.md · 684e6aed
  Min Xu authored Dec 02, 2021
  
  684e6aed
- Update questions-help-support.md · 451a1fe3
  Min Xu authored Dec 02, 2021
  
  451a1fe3
29 Nov, 2021 1 commit
- Add PyTorch version in README (#877) · f5c719b2
  Anupam Bhatnagar authored Nov 29, 2021
  
  f5c719b2
24 Nov, 2021 2 commits

[benchmarks]Add an MOE benchmark (#866) · 56254247

Ying Zhang authored Nov 24, 2021

* Add MOE to lm benchmarks

* linter

* Fix source / target

* address comments

* address comments

* address comments

* add circleci

* fix circleci

* precommit

56254247

[chore]Update README to specify the exact PyTorch version we are testing with. (#870) · 73187df0
anj-s authored Nov 23, 2021
```
* Update README to specify the exact PyTorch version we are testing with.

* update to 1.10.0 in the README
```
73187df0

21 Nov, 2021 1 commit
- Update README.md · b724a77e
  anj-s authored Nov 21, 2021
  
  b724a77e
19 Nov, 2021 1 commit

Add installation instructions through conda (#863) · 117fc8bd

h-vetinari authored Nov 20, 2021

* DOC: fix the rst-headers in installation instructions

* DOC: add installation through conda-forge to instructions

* DOC: fix rst-syntax in installation-instructions

* DOC: add comment about building from source with GPU-support

117fc8bd

18 Nov, 2021 4 commits

remove no-commit-to-branch hook (#861) · 824022be
Anupam Bhatnagar authored Nov 18, 2021

824022be

[chore] 0.4.3 release (#860) · 68d10f73

Min Xu authored Nov 18, 2021



* [chore] 0.4.3 release

* update setup.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

68d10f73

[fix] [MEVO]: make mevo work with eval and optim_state checkpointing (#851) · 0db50ce5

Min Xu authored Nov 18, 2021



* [fix]: fix eval for shared weight FSDP

* fixing optim state saving

* add changelog

* reformat with newer local isort

* update test

* avoid computing reference state unless we are testing training

* added optim_state test

* make mypy happy

* move tests; maybe we need to CUDA memory related tests in the first of the lists
Co-authored-by: Min Xu <min.xu.public@gmail.com>

0db50ce5

[POC] Testing Manual dispatch (#859) · fd831c4a

Anupam Bhatnagar authored Nov 18, 2021

* adding a manual workflow

* add push

* fix syntax

* adding a new workflow

* renaming file

* cleanup yaml files

* [skip ci] removing pyproject edits

fd831c4a

17 Nov, 2021 2 commits

[feature] Add a OffloadConfig object to specify offloading params to disk. (#855) · ef194cd2
anj-s authored Nov 17, 2021
```
* fixed lint issues

* remove unused print statements

* add changelog entry

* [skip ci] fix lint errors
```
ef194cd2

Update changelog, removed meta.yml and requirements cleanup (#853) · 2bfa5a61

Anupam Bhatnagar authored Nov 17, 2021

* update changelog

* [skip ci] removed requirements-test.txt

* [skip ci] updating changelog

* [skip ci] add PR numbers

* replacing requirements-test.txt by requirements-dev.txt

* [skip ci] changing requirements-test to requirements-dev in pre-commit and requirements-benchmarks

* [skip ci] mark manual static analysis checks as deprecated

* empty commit to trigger ci

* [skip ci] updating changelog

* [skip ci] addressing comments

* addressing more comments

2bfa5a61

15 Nov, 2021 1 commit

Allow sharded grad scaler to cpu offload with FSDP (#831) · ba5785f7

Anupam Bhatnagar authored Nov 15, 2021

* first commit

* sharded scaler hitting nan assertions

* adding test for sharded grad scaler without cpu offload

* ddp grad scaler and fsdp sharded grad scaler test failing

* removing test_output

* fix no cpu offload test

* changing optimizer from OSS to SGD

* all tests passing, code cleanup pending

* code cleanup

* fix pyproject.toml

* removing .isort.cfg

* running isort linter

* resolving isort issues

* resolving black linter issue

* resolving mypy issues

* fix import statement

* fix mypy error

* modifying import statement

* adding pytorch version requirement

* fixing pytest skip test decorator

* apply version guard for ShardedGradScaler

* removing test_fsdp_grad_scaler

* increasing num_epochs for ShardedGradScaler so that updates are not skipped

* adding support for torch 1.8

* minor edit

* [skip ci] more torch 1.8 changes

* parametrizing the tests

* cleanup code with linters

* [skip ci] update doc string

* [skip ci] addressing some more comments

ba5785f7

12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

09 Nov, 2021 1 commit

CI config changes (#847) · 6f3931a4

Anupam Bhatnagar authored Nov 08, 2021

* CI config changes

* changing params for failing tests

* [skip ci] minor edit

6f3931a4

08 Nov, 2021 3 commits

[chore] 0.4.2 release (#846) · b65ce6ff

Anupam Bhatnagar authored Nov 08, 2021

* [chore] 0.4.2 release

* updating torch version

* [skip ci] updating readme and requirements.txt

b65ce6ff

[feature]Add support for SSD offload with FSDP for eval workloads (#839) · d7c4aa52

anj-s authored Nov 08, 2021

* update release notes

* initial commit

* lint cleanup etc.

* helper functions; lint errors

* lint errors

* lint errors

* add back the boolean for named_parameters

* address comments and fix lint

* remove unused functions and class

* remove unused state

d7c4aa52

[feat] Gossip/SlowMo (#378) · 21464e05

Benjamin Lefaudeux authored Nov 08, 2021



Add SlowMo Distributed Data Parallel for clusters with slow interconnects
Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>

21464e05

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

03 Nov, 2021 1 commit
- Update Sphinx version in docs requirements file (#841) · f327eb4a
  Vinayak Tantia authored Nov 03, 2021
  
  f327eb4a