Commits · 105f65078f385da2fc9a7746912bab92957f4503 · OpenDAS / fairscale

02 Mar, 2022 1 commit

Add a new arg, "force_broadcast_object", to OSS __init__ (#942) · 105f6507

foreveronehundred authored Mar 02, 2022

* [FSDP] Add an arg for FSDP __init__

Add an arg, disable_reshard_on_root, for FSDP __init__ to handle the following issue
https://github.com/facebookresearch/fairscale/issues/878


For some cases (models wrapped by autowrap), the parameters (of root modules) needs to be sharded, and reshard_after_forward should not be set to False.
"disable_reshard_on_root" is for users to choose whether to force reshard_after_forward of root modules to be False or not.

* Update fully_sharded_data_parallel.py

Modified the description of the feature to explain more clear.

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Update the comments for disable_reshard_on_root
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Modified the comments

Modified the comments of disable_reshard_on_root

* Add a new argument for OSS __init__

Add a new argument for OSS __init__ to force the OSS to apply "_broadcast_object" for rebuilding the sharded optimizer. For more details, please see https://github.com/facebookresearch/fairscale/issues/937



* Remove redundant space

Remove redundant space
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

105f6507

23 Feb, 2022 2 commits
- [fix][FSDP] Add support for saving optimizer state with expert replication (#936) · 40e7450f
  anj-s authored Feb 23, 2022
```
* checkpoint tests

* checkpoint tests

* fix tests

* lint fixes

* remove prints

* lint fixes

* add comments

* add changelog

* more cleanup

* lint fix
```
  40e7450f
- fix typo (#938) · cb72ae54
  anj-s authored Feb 22, 2022
  
  cb72ae54
22 Feb, 2022 1 commit

[benchmarks] Add benchmarks for FSDP (#765) · f9a125db

anj-s authored Feb 22, 2022

* add benchmarks for fsdp

* fix lint errors

* clean up

* clean up unused flags

* add the benchmarks

* remove unused args

* fix lint errors

* fix lint errors

* update command line

* add support for multiple devices

* try full fp16 mode

* try full fp16 mode

* lint errors

* merge main

* lint errors

* lint errors

* lint error

* update intersphinx mapping for numpy

* update intersphinx mapping for numpy

* skip test

* added golden configs

* use synthetic benchmarks

* fix fn name

* fix cuda device id

* fix verify

* lint fix

f9a125db

15 Feb, 2022 2 commits

Update CHANGELOG.md (#935) · 9090bfdc

ruanslv authored Feb 15, 2022

* Update CHANGELOG.md

Adding https://github.com/facebookresearch/fairscale/pull/930

 to changelog

* Update CHANGELOG.md
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

9090bfdc

[fix] Add option to wrap root module in auto_wrap (#930) · 3b8f445f

ruanslv authored Feb 15, 2022



* [fix] Add option to wrap root module in auto_wrap

* Fix unit-test comment

* adding a few more tests to make expected behavior clear

* move changes to wrap policy as suggested

* set default to false

* revert pre-commit change

* revert pre-commit change 2
Co-authored-by: Ruan Silva <ruanrms@fb.com>

3b8f445f

14 Feb, 2022 1 commit

[chore] [cleanup]: pytest, pytorch new versions, fix tests (#933) · fae29959

Min Xu authored Feb 14, 2022



* update pytest versions

* [test] test related changes

- upgrade to newer pytorch versions
- added function to make test more deterministic on A100 and TF32
- fixed some tests so that they are correctly skipped on a single GPU system

* more fixes

* formatting overly long lines

* format

* better test without trigger a warning

* fix an optim state bug with newer pytorch

- adam optimizer seems to return "step" as a singleton tensor now in the
nightly build
- this fixes it assumeing non-tensor value can still be loaded back by
the optimizer

* improve oss.py

- use min_loss for regression checking is a bit more reliable
- also increased the num epochs from 10 to 12

* small oss.py fix

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

fae29959

11 Feb, 2022 1 commit

[minor] skipping one more flaky test (#932) · 8527c587

Min Xu authored Feb 11, 2022



* skipping one more test

* formatting

* minor fix and copyright header

* comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8527c587

08 Feb, 2022 2 commits

[FSDP] Add an arg for FSDP __init__ (#926) · 67bf5bf8

foreveronehundred authored Feb 09, 2022

* [FSDP] Add an arg for FSDP __init__

Add an arg, disable_reshard_on_root, for FSDP __init__ to handle the following issue
https://github.com/facebookresearch/fairscale/issues/878


For some cases (models wrapped by autowrap), the parameters (of root modules) needs to be sharded, and reshard_after_forward should not be set to False.
"disable_reshard_on_root" is for users to choose whether to force reshard_after_forward of root modules to be False or not.

* Update fully_sharded_data_parallel.py

Modified the description of the feature to explain more clear.

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Update the comments for disable_reshard_on_root
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Modified the comments

Modified the comments of disable_reshard_on_root
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

67bf5bf8

[chore] Fix docs build by updating the numpy intersphinx mapping (#929) · 7202115e

anj-s authored Feb 08, 2022

* update intersphinx mapping for numpy

* update intersphinx mapping for numpy

* update pytorch mapping and disable test

7202115e

28 Jan, 2022 1 commit

[feat] add CosFace paper's LMCL to MEVO (#916) · 89e1ae5f

Min Xu authored Jan 27, 2022



* [feat] add CosFace paper's LMCL to MEVO

- added baseline algorithm to the reference kernel
- added MEVO version of LMCL
- added unit test to verify it is correct with respect to the reference as well as its memory usage

* updated changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

89e1ae5f

25 Jan, 2022 2 commits
- [minor] make backward assert a bit better (#919) · 8ba649e1
  Min Xu authored Jan 25, 2022
```
* [minor] better assert in backward

* mypy
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  8ba649e1
- [fix] reduce unit test memory and workaround the flakiness of the test (#917) · 5d8a505c
  Min Xu authored Jan 25, 2022
```
* [fix] reduce unit test memory

* set seed in CI

* fix random seed function

* giving up CI, //sigh
```
  5d8a505c
20 Jan, 2022 1 commit
- [FSDP] Add FairScale FSDP adoptions logging (#913) · 6f18e779
  Yanli Zhao authored Jan 20, 2022
```
* Add FairScale FSDP adoptions logging

* Add FairScale FSDP adoptions logging
```
  6f18e779
18 Jan, 2022 1 commit
- FSDP: better traceback for dtype assertion (#912) · fef44233
  Sam Shleifer authored Jan 17, 2022
  
  fef44233
14 Jan, 2022 3 commits
- 0.4.5 release · 6b2f992c
  Anupam Bhatnagar authored Jan 14, 2022
  
  6b2f992c
- [Chore]release 0.4.5 (#911) · 4a3bd93a
  tmarkstrum authored Jan 14, 2022
```
* release 0.4.5

* added some content for the release

* fixed a format issue.
```
  4a3bd93a
- small fixes to layerwise gradient scaler (#910) · 10d21b38
  Anupam Bhatnagar authored Jan 14, 2022
  
  10d21b38
13 Jan, 2022 3 commits

[skip ci] fixing typos · 39e7821a
Anupam Bhatnagar authored Jan 13, 2022

39e7821a

[feature] [experimental] Layerwise Gradient Scaler (#879) · 52d066a2

Anupam Bhatnagar authored Jan 12, 2022

* [skip ci] first commit

* [skip ci] gradient scaler example

* [skip ci] adding feed forward toy example

* [skip ci] adding types

* [skip ci] adding backward hook

* [skip ci] update

* [skip ci] working feed forward example

* [skip ci] working feed forward example

* [skip ci] use named_modules instead of named_children

* [skip ci] adding new file

* [skip ci] clean up

* [skip ci] implement unscale function

* [skip ci] implement unscale function

* [skip ci] removing old file

* [skip ci] removing some more old files

* [skip ci] making unscale function generic

* [skip ci] adding test for vision model

* [skip ci] adding identity layer

* [skip ci] cleanup files

* [skip ci] refactoring

* [skip ci] more refactoring

* [skip ci] added functionality to update scale

* [skip ci] data loader clean up

* [skip ci] implemented inf checks and update scale functions

* [skip ci]code clean up. added test with autocast. does not work atm

* adding documentation

* adding dependency in requirements-dev.txt

* updating pytorch nightly version

* updating changelog

* adding is_cuda_available to test_vision_model

* set same timeout on cpu and gpu

* reverting cpu timeout, skip vision test on cpu

* addressing comments, fixing vision test

* unscale uses in-place matmul

* some more cleanup

52d066a2

[Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19

tmarkstrum authored Jan 12, 2022



* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* added changelog

* fixed some commit.

* added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.

* throw an error instead of rolling back to use default process group for reduce_scatter_process_group

* Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"

This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.

* added check for None to avoid unit test failure

* fixed an error to avoid the unit tests failure
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

fb4eca19

12 Jan, 2022 1 commit

[chore] Update the CHANGELOG to add details about the new feature that enables... · 0044372c

tmarkstrum authored Jan 11, 2022

[chore] Update the CHANGELOG to add details about the new feature that enables reduce_scatter overlap in backward propagation (#906)

* updated the change log

* improve the change log

0044372c

07 Jan, 2022 1 commit

[FSDP] Enable FSDP reduce scatter overlap (#897) · 0a526bcb

tmarkstrum authored Jan 07, 2022

* enable reduce scatter overlap with other operations

* fixed unit tests and added docstrings for the new parameters for fsdp

* fixed more unit tests

* fixed unit tests

* avoided the pickle error on process_group_reduce_scatter

* removed an unnecessary parameter in unit tests

* remove unnecessary prints

* fixed the docstring

* skipped the test_offload unit test because this unit test failed in the main branch

* removed the enable_reduce_scatter_overlap API parameter

* added doc string for the defualt value of process_group_reduce_scatter parameter

* fixed a syntax bug

* fixed a bug which cause unitest failure

* removed the all_gather in the ProcessGroupName enum

* added more comment

* changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter

0a526bcb

06 Jan, 2022 2 commits

fix trailing space issue (#903) · 02a8913c
tmarkstrum authored Jan 06, 2022

02a8913c

FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb

four4fish authored Jan 06, 2022

* FullyShardedDataParallel: only return full state dict on rank 0

* Add flag and make rank 0 only optional

* Add tests

* Add docs

* address comments

* update comments

* update torch nightly version

* update torchvision number for torch nightly dependence

* add changelog

* Update CHANGELOG.md

* Update CHANGELOG.md

d3417ceb

05 Jan, 2022 1 commit

Enabling ssd_offload training basic tests. (#887) · c5e471bc

Paul Johnson authored Jan 05, 2022

* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
* Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
* Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
* Modifications from PR #887 review comments.
* Update Changelog

c5e471bc

24 Dec, 2021 1 commit
- [skip ci] update release.md (#896) · 541bb8c9
  Anupam Bhatnagar authored Dec 23, 2021
```
* [skip ci] update release.md

* [skip ci] minor edit
```
  541bb8c9
21 Dec, 2021 5 commits

0.4.4 release · 38af6d32
Anupam Bhatnagar authored Dec 21, 2021

38af6d32
[skip ci] updating date in changelog (#892) · 8397f766
Anupam Bhatnagar authored Dec 21, 2021

8397f766

Changelog update (#891) · 8e770bac

Anupam Bhatnagar authored Dec 21, 2021

* [skip ci] adding comments to changelog

* adding date to changelog

* [skip ci] minor edit

8e770bac

[Fix] - Finiteness check for all tensors (#890) · c3fc3894
Anupam Bhatnagar authored Dec 21, 2021
```
* Finiteness check for all tensors

* [skip ci] updating changelog
```
c3fc3894

Release automation (#888) · 49eacf12

Anupam Bhatnagar authored Dec 21, 2021

* [skip ci] first commit to automate release process

* empty commit

* fix syntax

* fix next_version value

* fixing more syntax

* remove uses

* fix

* fixed path in setup.py

* trying a basic example

* adding branch

* change release to name

* adding first step

* remove push trigger

* change order in ON section

* modifying manual workflow

* adding fairscale release workflow

* removing unused workflows

* replacing values with secrets

* fixing __version__ in __init__.py

* cleanup

* restoring import statement

49eacf12

16 Dec, 2021 1 commit

Added warn_on_trainable_params_changed constructor parameter to allow the user... · 99163d4f

Freddy Snijder authored Dec 16, 2021

Added warn_on_trainable_params_changed constructor parameter to allow the user to suppress the warning on trainable parameters changed (#886)

* Added warn_on_trainable_params_changed constructor parameter to allow the user to suppress the warning on trainable parameters changed; the default is True and thus the default behavior is unchanged

* Addded parameter documentation

99163d4f

13 Dec, 2021 1 commit

[feat] support eval in mevo (#884) · 56add6d5

Min Xu authored Dec 13, 2021

- During eval, we will fallback to just output projection without fusing
- added unit test to ensure the shape is correct

56add6d5

06 Dec, 2021 1 commit

Fix for Key Error that can happen in certain FSDP wrapping scenarios of... · e6acdcc3

Freddy Snijder authored Dec 06, 2021

Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) (#881)

* Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876)

* Styling fixes

* Updated the test to be independent of the Huggingface transformers package

* Added test for issue #876

* Small error message fix

* Skip test when CUDA is not available

* Fixed naming of model

e6acdcc3

02 Dec, 2021 5 commits
- [fix] [FSDP] Do not lose original reshard_after_forward (#880) · 7c2c3e00
  Min Xu authored Dec 02, 2021
```
* [fix] [FSDP] Do not lose original reshard_after_forward

- In a corner case we can lose this value
- Saving it and use it in the reset function fixed it
- A trivial case probably not worth a dedicated test for now

* added changelog
```
  7c2c3e00
- Update bug-report.md · 1eccb92d
  Min Xu authored Dec 02, 2021
  
  1eccb92d
- Update feature-request.md · f177f80c
  Min Xu authored Dec 02, 2021
  
  f177f80c
- Update questions-help-support.md · 684e6aed
  Min Xu authored Dec 02, 2021
  
  684e6aed
- Update questions-help-support.md · 451a1fe3
  Min Xu authored Dec 02, 2021
  
  451a1fe3