Commits · 67bf5bf8df7ccc8bd0300e11fb9aef31479af13e · OpenDAS / fairscale

08 Feb, 2022 2 commits

[FSDP] Add an arg for FSDP __init__ (#926) · 67bf5bf8

foreveronehundred authored Feb 09, 2022

* [FSDP] Add an arg for FSDP __init__

Add an arg, disable_reshard_on_root, for FSDP __init__ to handle the following issue
https://github.com/facebookresearch/fairscale/issues/878


For some cases (models wrapped by autowrap), the parameters (of root modules) needs to be sharded, and reshard_after_forward should not be set to False.
"disable_reshard_on_root" is for users to choose whether to force reshard_after_forward of root modules to be False or not.

* Update fully_sharded_data_parallel.py

Modified the description of the feature to explain more clear.

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Update the comments for disable_reshard_on_root
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Modified the comments

Modified the comments of disable_reshard_on_root
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

67bf5bf8

[chore] Fix docs build by updating the numpy intersphinx mapping (#929) · 7202115e

anj-s authored Feb 08, 2022

* update intersphinx mapping for numpy

* update intersphinx mapping for numpy

* update pytorch mapping and disable test

7202115e

28 Jan, 2022 1 commit

[feat] add CosFace paper's LMCL to MEVO (#916) · 89e1ae5f

Min Xu authored Jan 27, 2022



* [feat] add CosFace paper's LMCL to MEVO

- added baseline algorithm to the reference kernel
- added MEVO version of LMCL
- added unit test to verify it is correct with respect to the reference as well as its memory usage

* updated changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

89e1ae5f

25 Jan, 2022 2 commits
- [minor] make backward assert a bit better (#919) · 8ba649e1
  Min Xu authored Jan 25, 2022
```
* [minor] better assert in backward

* mypy
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  8ba649e1
- [fix] reduce unit test memory and workaround the flakiness of the test (#917) · 5d8a505c
  Min Xu authored Jan 25, 2022
```
* [fix] reduce unit test memory

* set seed in CI

* fix random seed function

* giving up CI, //sigh
```
  5d8a505c
20 Jan, 2022 1 commit
- [FSDP] Add FairScale FSDP adoptions logging (#913) · 6f18e779
  Yanli Zhao authored Jan 20, 2022
```
* Add FairScale FSDP adoptions logging

* Add FairScale FSDP adoptions logging
```
  6f18e779
18 Jan, 2022 1 commit
- FSDP: better traceback for dtype assertion (#912) · fef44233
  Sam Shleifer authored Jan 17, 2022
  
  fef44233
14 Jan, 2022 3 commits
- 0.4.5 release · 6b2f992c
  Anupam Bhatnagar authored Jan 14, 2022
  
  6b2f992c
- [Chore]release 0.4.5 (#911) · 4a3bd93a
  tmarkstrum authored Jan 14, 2022
```
* release 0.4.5

* added some content for the release

* fixed a format issue.
```
  4a3bd93a
- small fixes to layerwise gradient scaler (#910) · 10d21b38
  Anupam Bhatnagar authored Jan 14, 2022
  
  10d21b38
13 Jan, 2022 3 commits

[skip ci] fixing typos · 39e7821a
Anupam Bhatnagar authored Jan 13, 2022

39e7821a

[feature] [experimental] Layerwise Gradient Scaler (#879) · 52d066a2

Anupam Bhatnagar authored Jan 12, 2022

* [skip ci] first commit

* [skip ci] gradient scaler example

* [skip ci] adding feed forward toy example

* [skip ci] adding types

* [skip ci] adding backward hook

* [skip ci] update

* [skip ci] working feed forward example

* [skip ci] working feed forward example

* [skip ci] use named_modules instead of named_children

* [skip ci] adding new file

* [skip ci] clean up

* [skip ci] implement unscale function

* [skip ci] implement unscale function

* [skip ci] removing old file

* [skip ci] removing some more old files

* [skip ci] making unscale function generic

* [skip ci] adding test for vision model

* [skip ci] adding identity layer

* [skip ci] cleanup files

* [skip ci] refactoring

* [skip ci] more refactoring

* [skip ci] added functionality to update scale

* [skip ci] data loader clean up

* [skip ci] implemented inf checks and update scale functions

* [skip ci]code clean up. added...

52d066a2

[Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19

tmarkstrum authored Jan 12, 2022



* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* added changelog

* fixed some commit.

* added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.

* throw an error instead of rolling back to use default process group for reduce_scatter_process_group

* Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"

This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.

* added check for None to avoid unit test failure

* fixed an error to avoid the unit tests failure
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

fb4eca19

12 Jan, 2022 1 commit

[chore] Update the CHANGELOG to add details about the new feature that enables... · 0044372c

tmarkstrum authored Jan 11, 2022

[chore] Update the CHANGELOG to add details about the new feature that enables reduce_scatter overlap in backward propagation (#906)

* updated the change log

* improve the change log

0044372c

07 Jan, 2022 1 commit

[FSDP] Enable FSDP reduce scatter overlap (#897) · 0a526bcb

tmarkstrum authored Jan 07, 2022

* enable reduce scatter overlap with other operations

* fixed unit tests and added docstrings for the new parameters for fsdp

* fixed more unit tests

* fixed unit tests

* avoided the pickle error on process_group_reduce_scatter

* removed an unnecessary parameter in unit tests

* remove unnecessary prints

* fixed the docstring

* skipped the test_offload unit test because this unit test failed in the main branch

* removed the enable_reduce_scatter_overlap API parameter

* added doc string for the defualt value of process_group_reduce_scatter parameter

* fixed a syntax bug

* fixed a bug which cause unitest failure

* removed the all_gather in the ProcessGroupName enum

* added more comment

* changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter

0a526bcb

06 Jan, 2022 2 commits

fix trailing space issue (#903) · 02a8913c
tmarkstrum authored Jan 06, 2022

02a8913c

FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb

four4fish authored Jan 06, 2022

* FullyShardedDataParallel: only return full state dict on rank 0

* Add flag and make rank 0 only optional

* Add tests

* Add docs

* address comments

* update comments

* update torch nightly version

* update torchvision number for torch nightly dependence

* add changelog

* Update CHANGELOG.md

* Update CHANGELOG.md

d3417ceb

05 Jan, 2022 1 commit

Enabling ssd_offload training basic tests. (#887) · c5e471bc

Paul Johnson authored Jan 05, 2022

* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
* Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
* Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
* Modifications from PR #887 review comments.
* Update Changelog

c5e471bc

24 Dec, 2021 1 commit
- [skip ci] update release.md (#896) · 541bb8c9
  Anupam Bhatnagar authored Dec 23, 2021
```
* [skip ci] update release.md

* [skip ci] minor edit
```
  541bb8c9
21 Dec, 2021 5 commits

0.4.4 release · 38af6d32
Anupam Bhatnagar authored Dec 21, 2021

38af6d32
[skip ci] updating date in changelog (#892) · 8397f766
Anupam Bhatnagar authored Dec 21, 2021

8397f766

Changelog update (#891) · 8e770bac

Anupam Bhatnagar authored Dec 21, 2021

* [skip ci] adding comments to changelog

* adding date to changelog

* [skip ci] minor edit

8e770bac

[Fix] - Finiteness check for all tensors (#890) · c3fc3894
Anupam Bhatnagar authored Dec 21, 2021
```
* Finiteness check for all tensors

* [skip ci] updating changelog
```
c3fc3894

Release automation (#888) · 49eacf12

Anupam Bhatnagar authored Dec 21, 2021

* [skip ci] first commit to automate release process

* empty commit

* fix syntax

* fix next_version value

* fixing more syntax

* remove uses

* fix

* fixed path in setup.py

* trying a basic example

* adding branch

* change release to name

* adding first step

* remove push trigger

* change order in ON section

* modifying manual workflow

* adding fairscale release workflow

* removing unused workflows

* replacing values with secrets

* fixing __version__ in __init__.py

* cleanup

* restoring import statement

49eacf12

16 Dec, 2021 1 commit

Added warn_on_trainable_params_changed constructor parameter to allow the user... · 99163d4f

Freddy Snijder authored Dec 16, 2021

Added warn_on_trainable_params_changed constructor parameter to allow the user to suppress the warning on trainable parameters changed (#886)

* Added warn_on_trainable_params_changed constructor parameter to allow the user to suppress the warning on trainable parameters changed; the default is True and thus the default behavior is unchanged

* Addded parameter documentation

99163d4f

13 Dec, 2021 1 commit

[feat] support eval in mevo (#884) · 56add6d5

Min Xu authored Dec 13, 2021

- During eval, we will fallback to just output projection without fusing
- added unit test to ensure the shape is correct

56add6d5

06 Dec, 2021 1 commit

Fix for Key Error that can happen in certain FSDP wrapping scenarios of... · e6acdcc3

Freddy Snijder authored Dec 06, 2021

Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) (#881)

* Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876)

* Styling fixes

* Updated the test to be independent of the Huggingface transformers package

* Added test for issue #876

* Small error message fix

* Skip test when CUDA is not available

* Fixed naming of model

e6acdcc3

02 Dec, 2021 5 commits
- [fix] [FSDP] Do not lose original reshard_after_forward (#880) · 7c2c3e00
  Min Xu authored Dec 02, 2021
```
* [fix] [FSDP] Do not lose original reshard_after_forward

- In a corner case we can lose this value
- Saving it and use it in the reset function fixed it
- A trivial case probably not worth a dedicated test for now

* added changelog
```
  7c2c3e00
- Update bug-report.md · 1eccb92d
  Min Xu authored Dec 02, 2021
  
  1eccb92d
- Update feature-request.md · f177f80c
  Min Xu authored Dec 02, 2021
  
  f177f80c
- Update questions-help-support.md · 684e6aed
  Min Xu authored Dec 02, 2021
  
  684e6aed
- Update questions-help-support.md · 451a1fe3
  Min Xu authored Dec 02, 2021
  
  451a1fe3
29 Nov, 2021 1 commit
- Add PyTorch version in README (#877) · f5c719b2
  Anupam Bhatnagar authored Nov 29, 2021
  
  f5c719b2
24 Nov, 2021 2 commits

[benchmarks]Add an MOE benchmark (#866) · 56254247

Ying Zhang authored Nov 24, 2021

* Add MOE to lm benchmarks

* linter

* Fix source / target

* address comments

* address comments

* address comments

* add circleci

* fix circleci

* precommit

56254247

[chore]Update README to specify the exact PyTorch version we are testing with. (#870) · 73187df0
anj-s authored Nov 23, 2021
```
* Update README to specify the exact PyTorch version we are testing with.

* update to 1.10.0 in the README
```
73187df0

21 Nov, 2021 1 commit
- Update README.md · b724a77e
  anj-s authored Nov 21, 2021
  
  b724a77e
19 Nov, 2021 1 commit

Add installation instructions through conda (#863) · 117fc8bd

h-vetinari authored Nov 20, 2021

* DOC: fix the rst-headers in installation instructions

* DOC: add installation through conda-forge to instructions

* DOC: fix rst-syntax in installation-instructions

* DOC: add comment about building from source with GPU-support

117fc8bd

18 Nov, 2021 3 commits

remove no-commit-to-branch hook (#861) · 824022be
Anupam Bhatnagar authored Nov 18, 2021

824022be

[chore] 0.4.3 release (#860) · 68d10f73

Min Xu authored Nov 18, 2021



* [chore] 0.4.3 release

* update setup.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

68d10f73

[fix] [MEVO]: make mevo work with eval and optim_state checkpointing (#851) · 0db50ce5

Min Xu authored Nov 18, 2021



* [fix]: fix eval for shared weight FSDP

* fixing optim state saving

* add changelog

* reformat with newer local isort

* update test

* avoid computing reference state unless we are testing training

* added optim_state test

* make mypy happy

* move tests; maybe we need to CUDA memory related tests in the first of the lists
Co-authored-by: Min Xu <min.xu.public@gmail.com>

0db50ce5