Commits · 51b53ddb6c3aa77426c7d5cc0b543b79628053c4 · OpenDAS / fairscale

02 May, 2022 1 commit

[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and... · 51b53ddb

Paul Johnson authored May 02, 2022

[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and SsdFlatParameterView (#974)

* [FSDP] fixing backward path for SsdFlatParameter and SsdFlatParameterView when overriding .data

* Get ssd_offload unit tests passing

* [FSDP] get all test_fsdp_offload tests passing w/ ssd_offload on

* Update changelog

51b53ddb

27 Apr, 2022 1 commit
- Fix docstring format for `AdaScaleWrapper` (#979) · 58ccb166
  Carlos Mocholí authored Apr 27, 2022
  
  58ccb166
26 Apr, 2022 1 commit
- skip failed ssd offload tests for nightly (#977) · e65833a0
  Min Xu authored Apr 25, 2022
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  e65833a0
25 Apr, 2022 2 commits
- [chore] update nightly version (#976) · 8baa03b0
  Min Xu authored Apr 25, 2022
```
* [chore] update nightly version

* use yesterday's
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  8baa03b0
- Require py 3.7+ (#975) · 4a6afa2c
  Joel Stremmel authored Apr 25, 2022
  
  4a6afa2c
06 Apr, 2022 1 commit

Improvements to ssd_offload to support pickling/unpickling SsdTensorHandle... · 92f27daa

Paul Johnson authored Apr 06, 2022

Improvements to ssd_offload to support pickling/unpickling SsdTensorHandle (and derived classes) (#964)

Verified that FSDP wrapped models using ssd_offload checkpoint save and restore correctly

92f27daa

30 Mar, 2022 1 commit

Remove sort_iseed_config and related dependencies. (#969) · 72f373c1

Paul Johnson authored Mar 30, 2022

This is no longer needed since isort's version is 5.10

Also fix black version to 22.3.0 to fix issue with click
dependency.

Update files that now fail with new version of black {a = 2 ** 4} ->
{a = 2**4}

72f373c1

16 Mar, 2022 1 commit

[FSDP] Upstream fairseq big changes (#956) · 1bc96fa8

Christopher Dewan authored Mar 16, 2022



* made gradient predivide factor configurable

* fix lints
Co-authored-by: Your Name <you@example.com>

1bc96fa8

09 Mar, 2022 2 commits
- 0.4.6 release · 3c24beb9
  Anupam Bhatnagar authored Mar 09, 2022
  
  3c24beb9
- [chore] 0.4.6 release (#953) · 3e36cd07
  tmarkstrum authored Mar 09, 2022
```
* [chore] 0.4.6 release

* added the third party libs removed by precommit
```
  3e36cd07
08 Mar, 2022 1 commit

[chore] Fix copyright headers & fixed issue with mypy & NumPy versions in pre-commit (#951) · 8fa26ae4

Min Xu authored Mar 08, 2022



* copyright headers

* isort and pyproject.toml

* precommit and requirement for isort-seed-config

* mypy

* dummy change

* numpy version for pre-commit

* fix mypy issue caused by numpy
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8fa26ae4

05 Mar, 2022 1 commit

docs: add GH button in support of Ukraine (#949) · 2877474c

Dmitry Vinnik authored Mar 04, 2022

* Adding ELI5 video to Fairscale

* docs: add GH button in support of Ukraine

## Summary:
Our mission at Meta Open Source is to empower communities through open source, and we believe that it means building a welcoming and safe environment for all. As a part of this work, we are adding this banner in support for Ukraine during this crisis.

2877474c

04 Mar, 2022 1 commit
- Update README.md (#946) · a444eeec
  Vittorio Caggiano authored Mar 04, 2022
  
  a444eeec
03 Mar, 2022 1 commit

[fix] FSDP: EMA related fixes (#922) · 9f347f37

Min Xu authored Mar 03, 2022



* add an ignore file

* [fix] FSDP: handle the lazy_init better

- when state_dict and load_state_dict is called, let'em not change
  the lazy_init state.

* changelog

* longer timeout

* Revert "longer timeout"

This reverts commit 00cc145fe86210a0972a1e7ba4f37531b9e091eb.

* testing

* adding the failed test

* fix the global to local id

* formatting

* more complete fix and test

* minor fix for an assert

* update changelog

* remove an extra line

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* Update fairscale/nn/data_parallel/fsdp_optim_utils.py
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* addressed review comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

9f347f37

02 Mar, 2022 2 commits

Adding ELI5 video to Fairscale (#939) · 2ca4f0ee
Dmitry Vinnik authored Mar 02, 2022

2ca4f0ee

Add a new arg, "force_broadcast_object", to OSS __init__ (#942) · 105f6507

foreveronehundred authored Mar 02, 2022

* [FSDP] Add an arg for FSDP __init__

Add an arg, disable_reshard_on_root, for FSDP __init__ to handle the following issue
https://github.com/facebookresearch/fairscale/issues/878


For some cases (models wrapped by autowrap), the parameters (of root modules) needs to be sharded, and reshard_after_forward should not be set to False.
"disable_reshard_on_root" is for users to choose whether to force reshard_after_forward of root modules to be False or not.

* Update fully_sharded_data_parallel.py

Modified the description of the feature to explain more clear.

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Update the comments for disable_reshard_on_root
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Modified the comments

Modified the comments of disable_reshard_on_root

* Add a new argument for OSS __init__

Add a new argument for OSS __init__ to force the OSS to apply "_broadcast_object" for rebuilding the sharded optimizer. For more details, please see https://github.com/facebookresearch/fairscale/issues/937



* Remove redundant space

Remove redundant space
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

105f6507

23 Feb, 2022 2 commits
- [fix][FSDP] Add support for saving optimizer state with expert replication (#936) · 40e7450f
  anj-s authored Feb 23, 2022
```
* checkpoint tests

* checkpoint tests

* fix tests

* lint fixes

* remove prints

* lint fixes

* add comments

* add changelog

* more cleanup

* lint fix
```
  40e7450f
- fix typo (#938) · cb72ae54
  anj-s authored Feb 22, 2022
  
  cb72ae54
22 Feb, 2022 1 commit

[benchmarks] Add benchmarks for FSDP (#765) · f9a125db

anj-s authored Feb 22, 2022

* add benchmarks for fsdp

* fix lint errors

* clean up

* clean up unused flags

* add the benchmarks

* remove unused args

* fix lint errors

* fix lint errors

* update command line

* add support for multiple devices

* try full fp16 mode

* try full fp16 mode

* lint errors

* merge main

* lint errors

* lint errors

* lint error

* update intersphinx mapping for numpy

* update intersphinx mapping for numpy

* skip test

* added golden configs

* use synthetic benchmarks

* fix fn name

* fix cuda device id

* fix verify

* lint fix

f9a125db

15 Feb, 2022 2 commits

Update CHANGELOG.md (#935) · 9090bfdc

ruanslv authored Feb 15, 2022

* Update CHANGELOG.md

Adding https://github.com/facebookresearch/fairscale/pull/930

 to changelog

* Update CHANGELOG.md
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

9090bfdc

[fix] Add option to wrap root module in auto_wrap (#930) · 3b8f445f

ruanslv authored Feb 15, 2022



* [fix] Add option to wrap root module in auto_wrap

* Fix unit-test comment

* adding a few more tests to make expected behavior clear

* move changes to wrap policy as suggested

* set default to false

* revert pre-commit change

* revert pre-commit change 2
Co-authored-by: Ruan Silva <ruanrms@fb.com>

3b8f445f

14 Feb, 2022 1 commit

[chore] [cleanup]: pytest, pytorch new versions, fix tests (#933) · fae29959

Min Xu authored Feb 14, 2022



* update pytest versions

* [test] test related changes

- upgrade to newer pytorch versions
- added function to make test more deterministic on A100 and TF32
- fixed some tests so that they are correctly skipped on a single GPU system

* more fixes

* formatting overly long lines

* format

* better test without trigger a warning

* fix an optim state bug with newer pytorch

- adam optimizer seems to return "step" as a singleton tensor now in the
nightly build
- this fixes it assumeing non-tensor value can still be loaded back by
the optimizer

* improve oss.py

- use min_loss for regression checking is a bit more reliable
- also increased the num epochs from 10 to 12

* small oss.py fix

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

fae29959

11 Feb, 2022 1 commit

[minor] skipping one more flaky test (#932) · 8527c587

Min Xu authored Feb 11, 2022



* skipping one more test

* formatting

* minor fix and copyright header

* comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8527c587

08 Feb, 2022 2 commits

[FSDP] Add an arg for FSDP __init__ (#926) · 67bf5bf8

foreveronehundred authored Feb 09, 2022

* [FSDP] Add an arg for FSDP __init__

Add an arg, disable_reshard_on_root, for FSDP __init__ to handle the following issue
https://github.com/facebookresearch/fairscale/issues/878


For some cases (models wrapped by autowrap), the parameters (of root modules) needs to be sharded, and reshard_after_forward should not be set to False.
"disable_reshard_on_root" is for users to choose whether to force reshard_after_forward of root modules to be False or not.

* Update fully_sharded_data_parallel.py

Modified the description of the feature to explain more clear.

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Update the comments for disable_reshard_on_root
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Modified the comments

Modified the comments of disable_reshard_on_root
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

67bf5bf8

[chore] Fix docs build by updating the numpy intersphinx mapping (#929) · 7202115e

anj-s authored Feb 08, 2022

* update intersphinx mapping for numpy

* update intersphinx mapping for numpy

* update pytorch mapping and disable test

7202115e

28 Jan, 2022 1 commit

[feat] add CosFace paper's LMCL to MEVO (#916) · 89e1ae5f

Min Xu authored Jan 27, 2022



* [feat] add CosFace paper's LMCL to MEVO

- added baseline algorithm to the reference kernel
- added MEVO version of LMCL
- added unit test to verify it is correct with respect to the reference as well as its memory usage

* updated changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

89e1ae5f

25 Jan, 2022 2 commits
- [minor] make backward assert a bit better (#919) · 8ba649e1
  Min Xu authored Jan 25, 2022
```
* [minor] better assert in backward

* mypy
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  8ba649e1
- [fix] reduce unit test memory and workaround the flakiness of the test (#917) · 5d8a505c
  Min Xu authored Jan 25, 2022
```
* [fix] reduce unit test memory

* set seed in CI

* fix random seed function

* giving up CI, //sigh
```
  5d8a505c
20 Jan, 2022 1 commit
- [FSDP] Add FairScale FSDP adoptions logging (#913) · 6f18e779
  Yanli Zhao authored Jan 20, 2022
```
* Add FairScale FSDP adoptions logging

* Add FairScale FSDP adoptions logging
```
  6f18e779
18 Jan, 2022 1 commit
- FSDP: better traceback for dtype assertion (#912) · fef44233
  Sam Shleifer authored Jan 17, 2022
  
  fef44233
14 Jan, 2022 3 commits
- 0.4.5 release · 6b2f992c
  Anupam Bhatnagar authored Jan 14, 2022
  
  6b2f992c
- [Chore]release 0.4.5 (#911) · 4a3bd93a
  tmarkstrum authored Jan 14, 2022
```
* release 0.4.5

* added some content for the release

* fixed a format issue.
```
  4a3bd93a
- small fixes to layerwise gradient scaler (#910) · 10d21b38
  Anupam Bhatnagar authored Jan 14, 2022
  
  10d21b38
13 Jan, 2022 3 commits

[skip ci] fixing typos · 39e7821a
Anupam Bhatnagar authored Jan 13, 2022

39e7821a

[feature] [experimental] Layerwise Gradient Scaler (#879) · 52d066a2

Anupam Bhatnagar authored Jan 12, 2022

* [skip ci] first commit

* [skip ci] gradient scaler example

* [skip ci] adding feed forward toy example

* [skip ci] adding types

* [skip ci] adding backward hook

* [skip ci] update

* [skip ci] working feed forward example

* [skip ci] working feed forward example

* [skip ci] use named_modules instead of named_children

* [skip ci] adding new file

* [skip ci] clean up

* [skip ci] implement unscale function

* [skip ci] implement unscale function

* [skip ci] removing old file

* [skip ci] removing some more old files

* [skip ci] making unscale function generic

* [skip ci] adding test for vision model

* [skip ci] adding identity layer

* [skip ci] cleanup files

* [skip ci] refactoring

* [skip ci] more refactoring

* [skip ci] added functionality to update scale

* [skip ci] data loader clean up

* [skip ci] implemented inf checks and update scale functions

* [skip ci]code clean up. added test with autocast. does not work atm

* adding documentation

* adding dependency in requirements-dev.txt

* updating pytorch nightly version

* updating changelog

* adding is_cuda_available to test_vision_model

* set same timeout on cpu and gpu

* reverting cpu timeout, skip vision test on cpu

* addressing comments, fixing vision test

* unscale uses in-place matmul

* some more cleanup

52d066a2

[Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19

tmarkstrum authored Jan 12, 2022



* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* added changelog

* fixed some commit.

* added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.

* throw an error instead of rolling back to use default process group for reduce_scatter_process_group

* Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"

This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.

* added check for None to avoid unit test failure

* fixed an error to avoid the unit tests failure
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

fb4eca19

12 Jan, 2022 1 commit

[chore] Update the CHANGELOG to add details about the new feature that enables... · 0044372c

tmarkstrum authored Jan 11, 2022

[chore] Update the CHANGELOG to add details about the new feature that enables reduce_scatter overlap in backward propagation (#906)

* updated the change log

* improve the change log

0044372c

07 Jan, 2022 1 commit

[FSDP] Enable FSDP reduce scatter overlap (#897) · 0a526bcb

tmarkstrum authored Jan 07, 2022

* enable reduce scatter overlap with other operations

* fixed unit tests and added docstrings for the new parameters for fsdp

* fixed more unit tests

* fixed unit tests

* avoided the pickle error on process_group_reduce_scatter

* removed an unnecessary parameter in unit tests

* remove unnecessary prints

* fixed the docstring

* skipped the test_offload unit test because this unit test failed in the main branch

* removed the enable_reduce_scatter_overlap API parameter

* added doc string for the defualt value of process_group_reduce_scatter parameter

* fixed a syntax bug

* fixed a bug which cause unitest failure

* removed the all_gather in the ProcessGroupName enum

* added more comment

* changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter

0a526bcb

06 Jan, 2022 2 commits

fix trailing space issue (#903) · 02a8913c
tmarkstrum authored Jan 06, 2022

02a8913c

FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb

four4fish authored Jan 06, 2022

* FullyShardedDataParallel: only return full state dict on rank 0

* Add flag and make rank 0 only optional

* Add tests

* Add docs

* address comments

* update comments

* update torch nightly version

* update torchvision number for torch nightly dependence

* add changelog

* Update CHANGELOG.md

* Update CHANGELOG.md

d3417ceb