Commits · 40e7450f6fd1ba27a71c58da2b49cd19a46b7678 · OpenDAS / fairscale

23 Feb, 2022 2 commits
- [fix][FSDP] Add support for saving optimizer state with expert replication (#936) · 40e7450f
  anj-s authored Feb 23, 2022
```
* checkpoint tests

* checkpoint tests

* fix tests

* lint fixes

* remove prints

* lint fixes

* add comments

* add changelog

* more cleanup

* lint fix
```
  40e7450f
- fix typo (#938) · cb72ae54
  anj-s authored Feb 22, 2022
  
  cb72ae54
22 Feb, 2022 1 commit

[benchmarks] Add benchmarks for FSDP (#765) · f9a125db

anj-s authored Feb 22, 2022

* add benchmarks for fsdp

* fix lint errors

* clean up

* clean up unused flags

* add the benchmarks

* remove unused args

* fix lint errors

* fix lint errors

* update command line

* add support for multiple devices

* try full fp16 mode

* try full fp16 mode

* lint errors

* merge main

* lint errors

* lint errors

* lint error

* update intersphinx mapping for numpy

* update intersphinx mapping for numpy

* skip test

* added golden configs

* use synthetic benchmarks

* fix fn name

* fix cuda device id

* fix verify

* lint fix

f9a125db

15 Feb, 2022 1 commit

[fix] Add option to wrap root module in auto_wrap (#930) · 3b8f445f

ruanslv authored Feb 15, 2022



* [fix] Add option to wrap root module in auto_wrap

* Fix unit-test comment

* adding a few more tests to make expected behavior clear

* move changes to wrap policy as suggested

* set default to false

* revert pre-commit change

* revert pre-commit change 2
Co-authored-by: Ruan Silva <ruanrms@fb.com>

3b8f445f

14 Feb, 2022 1 commit

[chore] [cleanup]: pytest, pytorch new versions, fix tests (#933) · fae29959

Min Xu authored Feb 14, 2022



* update pytest versions

* [test] test related changes

- upgrade to newer pytorch versions
- added function to make test more deterministic on A100 and TF32
- fixed some tests so that they are correctly skipped on a single GPU system

* more fixes

* formatting overly long lines

* format

* better test without trigger a warning

* fix an optim state bug with newer pytorch

- adam optimizer seems to return "step" as a singleton tensor now in the
nightly build
- this fixes it assumeing non-tensor value can still be loaded back by
the optimizer

* improve oss.py

- use min_loss for regression checking is a bit more reliable
- also increased the num epochs from 10 to 12

* small oss.py fix

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

fae29959

11 Feb, 2022 1 commit

[minor] skipping one more flaky test (#932) · 8527c587

Min Xu authored Feb 11, 2022



* skipping one more test

* formatting

* minor fix and copyright header

* comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8527c587

08 Feb, 2022 1 commit

[chore] Fix docs build by updating the numpy intersphinx mapping (#929) · 7202115e

anj-s authored Feb 08, 2022

* update intersphinx mapping for numpy

* update intersphinx mapping for numpy

* update pytorch mapping and disable test

7202115e

28 Jan, 2022 1 commit

[feat] add CosFace paper's LMCL to MEVO (#916) · 89e1ae5f

Min Xu authored Jan 27, 2022



* [feat] add CosFace paper's LMCL to MEVO

- added baseline algorithm to the reference kernel
- added MEVO version of LMCL
- added unit test to verify it is correct with respect to the reference as well as its memory usage

* updated changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

89e1ae5f

25 Jan, 2022 1 commit
- [fix] reduce unit test memory and workaround the flakiness of the test (#917) · 5d8a505c
  Min Xu authored Jan 25, 2022
```
* [fix] reduce unit test memory

* set seed in CI

* fix random seed function

* giving up CI, //sigh
```
  5d8a505c
14 Jan, 2022 1 commit
- small fixes to layerwise gradient scaler (#910) · 10d21b38
  Anupam Bhatnagar authored Jan 14, 2022
  
  10d21b38
13 Jan, 2022 2 commits

[feature] [experimental] Layerwise Gradient Scaler (#879) · 52d066a2

Anupam Bhatnagar authored Jan 12, 2022

* [skip ci] first commit

* [skip ci] gradient scaler example

* [skip ci] adding feed forward toy example

* [skip ci] adding types

* [skip ci] adding backward hook

* [skip ci] update

* [skip ci] working feed forward example

* [skip ci] working feed forward example

* [skip ci] use named_modules instead of named_children

* [skip ci] adding new file

* [skip ci] clean up

* [skip ci] implement unscale function

* [skip ci] implement unscale function

* [skip ci] removing old file

* [skip ci] removing some more old files

* [skip ci] making unscale function generic

* [skip ci] adding test for vision model

* [skip ci] adding identity layer

* [skip ci] cleanup files

* [skip ci] refactoring

* [skip ci] more refactoring

* [skip ci] added functionality to update scale

* [skip ci] data loader clean up

* [skip ci] implemented inf checks and update scale functions

* [skip ci]code clean up. added...

52d066a2

[Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19

tmarkstrum authored Jan 12, 2022



* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* added changelog

* fixed some commit.

* added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.

* throw an error instead of rolling back to use default process group for reduce_scatter_process_group

* Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"

This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.

* added check for None to avoid unit test failure

* fixed an error to avoid the unit tests failure
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

fb4eca19

07 Jan, 2022 1 commit

[FSDP] Enable FSDP reduce scatter overlap (#897) · 0a526bcb

tmarkstrum authored Jan 07, 2022

* enable reduce scatter overlap with other operations

* fixed unit tests and added docstrings for the new parameters for fsdp

* fixed more unit tests

* fixed unit tests

* avoided the pickle error on process_group_reduce_scatter

* removed an unnecessary parameter in unit tests

* remove unnecessary prints

* fixed the docstring

* skipped the test_offload unit test because this unit test failed in the main branch

* removed the enable_reduce_scatter_overlap API parameter

* added doc string for the defualt value of process_group_reduce_scatter parameter

* fixed a syntax bug

* fixed a bug which cause unitest failure

* removed the all_gather in the ProcessGroupName enum

* added more comment

* changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter

0a526bcb

06 Jan, 2022 1 commit

FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb

four4fish authored Jan 06, 2022

* FullyShardedDataParallel: only return full state dict on rank 0

* Add flag and make rank 0 only optional

* Add tests

* Add docs

* address comments

* update comments

* update torch nightly version

* update torchvision number for torch nightly dependence

* add changelog

* Update CHANGELOG.md

* Update CHANGELOG.md

d3417ceb

05 Jan, 2022 1 commit

Enabling ssd_offload training basic tests. (#887) · c5e471bc

Paul Johnson authored Jan 05, 2022

* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
* Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
* Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
* Modifications from PR #887 review comments.
* Update Changelog

c5e471bc

13 Dec, 2021 1 commit

[feat] support eval in mevo (#884) · 56add6d5

Min Xu authored Dec 13, 2021

- During eval, we will fallback to just output projection without fusing
- added unit test to ensure the shape is correct

56add6d5

06 Dec, 2021 1 commit

Fix for Key Error that can happen in certain FSDP wrapping scenarios of... · e6acdcc3

Freddy Snijder authored Dec 06, 2021

Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) (#881)

* Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876)

* Styling fixes

* Updated the test to be independent of the Huggingface transformers package

* Added test for issue #876

* Small error message fix

* Skip test when CUDA is not available

* Fixed naming of model

e6acdcc3

18 Nov, 2021 1 commit

[fix] [MEVO]: make mevo work with eval and optim_state checkpointing (#851) · 0db50ce5

Min Xu authored Nov 18, 2021



* [fix]: fix eval for shared weight FSDP

* fixing optim state saving

* add changelog

* reformat with newer local isort

* update test

* avoid computing reference state unless we are testing training

* added optim_state test

* make mypy happy

* move tests; maybe we need to CUDA memory related tests in the first of the lists
Co-authored-by: Min Xu <min.xu.public@gmail.com>

0db50ce5

17 Nov, 2021 1 commit
- [feature] Add a OffloadConfig object to specify offloading params to disk. (#855) · ef194cd2
  anj-s authored Nov 17, 2021
```
* fixed lint issues

* remove unused print statements

* add changelog entry

* [skip ci] fix lint errors
```
  ef194cd2
15 Nov, 2021 1 commit

Allow sharded grad scaler to cpu offload with FSDP (#831) · ba5785f7

Anupam Bhatnagar authored Nov 15, 2021

* first commit

* sharded scaler hitting nan assertions

* adding test for sharded grad scaler without cpu offload

* ddp grad scaler and fsdp sharded grad scaler test failing

* removing test_output

* fix no cpu offload test

* changing optimizer from OSS to SGD

* all tests passing, code cleanup pending

* code cleanup

* fix pyproject.toml

* removing .isort.cfg

* running isort linter

* resolving isort issues

* resolving black linter issue

* resolving mypy issues

* fix import statement

* fix mypy error

* modifying import statement

* adding pytorch version requirement

* fixing pytest skip test decorator

* apply version guard for ShardedGradScaler

* removing test_fsdp_grad_scaler

* increasing num_epochs for ShardedGradScaler so that updates are not skipped

* adding support for torch 1.8

* minor edit

* [skip ci] more torch 1.8 changes

* parametrizing the tests

* cleanup code with linters

* [skip ci] update doc string

* [skip ci] addressing some more comments

ba5785f7

12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

09 Nov, 2021 1 commit

CI config changes (#847) · 6f3931a4

Anupam Bhatnagar authored Nov 08, 2021

* CI config changes

* changing params for failing tests

* [skip ci] minor edit

6f3931a4

08 Nov, 2021 2 commits

[feature]Add support for SSD offload with FSDP for eval workloads (#839) · d7c4aa52

anj-s authored Nov 08, 2021

* update release notes

* initial commit

* lint cleanup etc.

* helper functions; lint errors

* lint errors

* lint errors

* add back the boolean for named_parameters

* address comments and fix lint

* remove unused functions and class

* remove unused state

d7c4aa52

[feat] Gossip/SlowMo (#378) · 21464e05

Benjamin Lefaudeux authored Nov 08, 2021



Add SlowMo Distributed Data Parallel for clusters with slow interconnects
Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>

21464e05

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

01 Nov, 2021 2 commits

[feat] [FSDP]: add experimental support to shared weights (#836) · f2af4c66

Min Xu authored Nov 01, 2021



* added a new test, passing without shared weights

* tested weight sharing

* added the test to test list file

* extended to world_size = 2

* fixed test

* [feat]: add limited and experimental support for shared parameter

* fixed tests

* simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param

* fixed the case where linked param is not in separate FSDP

* changelog and remove old code
Co-authored-by: Min Xu <min.xu.public@gmail.com>

f2af4c66

[feature] Add the low level SSD APIs (#829) · a9fcaa28

anj-s authored Nov 01, 2021

* add doc strings

* add lower level SSD APIs and tests

* add the test to the list to be run

* remove unused imports

* more doc string changes

* fix lint errors

a9fcaa28

28 Oct, 2021 1 commit

[fix] fix test on main (#835) · 28aa2dde

Min Xu authored Oct 28, 2021



* [fix] fix test on main

* [fix] fix test on main
Co-authored-by: Min Xu <min.xu.public@gmail.com>

28aa2dde

27 Oct, 2021 4 commits

[fix] Decouple `move_params_to_cpu` from the `mixed_precision`. (#822) · ed7ca766

anj-s authored Oct 27, 2021

* remove offload dependency on fp16

* update python version for cpu tess

* run CPU tests with updated PyTorch version

* split changes

* revert tests config

* fix lint errors

* update nightly and test PyTorch versions

* skip failing multiprocess pipe test

* always skip test

* always skip test

* always skip test

* lint error

* skip unsupported versions

* improve skip message

* lint errors

* modify docs

* add tests

* fix test failures

* modify comments

* fix lint errors

* fix lint errors

ed7ca766

[test] improve a test's coverage (#798) · b60f3db0

Min Xu authored Oct 27, 2021



* checkpoint + nonflat + mixed_precision

* make tests pass with expected errors

* addressed comments

* add a comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

b60f3db0

[fix]: Fixes an issue with pre_backward hook registering (#833) · 5da5c0eb

Min Xu authored Oct 27, 2021



* added the failing test

* fixed the bug

* fine-tune the condition

* typo

* typo

* changelog and added test to test files
Co-authored-by: Min Xu <min.xu.public@gmail.com>

5da5c0eb

Use correct node names for param counting in auto_shard. (#830) · 86c62cc9
Eugen Hotaj authored Oct 26, 2021
```
Fixes #827.
Co-authored-by: Eugen Hotaj <ehotaj@fb.com>
```
86c62cc9

22 Oct, 2021 1 commit

Extend auto shard capabilities to work around torch.fx edge cases. (#817) · 7bdf50a3

Eugen Hotaj authored Oct 22, 2021

auto_shard.py currently uses torch.fx to create a symbolic DAG of
operations and linearizes that DAG into an nn.Sequential so it can later
be used for model offloading. This works in most cases but runs into
issues for certain eager mode features, such as dynamic conditionals,
shape-dependent computation, etc.

This PR extends auto_shard.py to first run a preprocessing step which wraps
any nn.Module which cannot be traced through. It adds a test for dynamic
conditionals and updates existing failing test code.

There are some immediate extensions to this approach which are marked as
TODO in the code.

7bdf50a3

21 Oct, 2021 1 commit

[chore] Update the PyTorch version that we run CPU tests with (#809) · 11a24161

anj-s authored Oct 20, 2021

* update python version for cpu tess

* run CPU tests with updated PyTorch version

* update nightly and test PyTorch versions

* skip failing multiprocess pipe test

* always skip test

* always skip test

* always skip test

* lint error

* skip unsupported versions

* improve skip message

* lint errors

11a24161

20 Oct, 2021 1 commit

[feat] layer memory tracking (#808) · ad92220c

Quentin Duval authored Oct 20, 2021



* [feat] layer memory tracking

* [feat] layer memory tracking (add tests in CI)

* [feat] layer memory tracking: doc typos

* [feat] layer memory tracking: mypy fixes

* [feat] layer memory tracking: fixes for FSDP all gather tracking on pytorch 1.9 and above

* [feat] layer memory tracking: lint

* [feat] layer memory tracking: mypy
Co-authored-by: QuentinDuval <QuentinDuval@users.noreply.github.com>

ad92220c

13 Sep, 2021 1 commit
- [OSS] Fixing the fp16 broadcast and catching this case in the unit test (#795) · 180ab8c8
  Benjamin Lefaudeux authored Sep 13, 2021
  
  180ab8c8
12 Sep, 2021 1 commit

[fix] FSDP intra-backwards gradient accumulation. (#784) · 4fa2ab9b

Darryl Barnhart authored Sep 12, 2021

* [fix] FSDP intra-backwards gradient accumulation.

Ensure gradient reduction accumulates into the unsharded gradient tensor
within a backwards pass. This matters when an FSDP module is called
multiple times within a forward pass, and reduction is _not_ deferred
using activation checkpoint forward counters, bucketing or some other
mechanism.

Closes #780

* [refactor] Remove forward counters. Comments.

Removed forward counters from the activation checkpointing utility, now
that FSDP does not require them for correct operation. Add more detailed
comment about memory usage behaviour with gradient reduction.

* [refactor] Delete deprecated forward counter usage.

* [refactor] Add state assertion as end of pre-backward hook.

4fa2ab9b

11 Sep, 2021 1 commit

[feat] set requires_grad of output tensors of checkpointed modules properly (#787) · 482944d9

Alex Xiao authored Sep 10, 2021



Before this commit, output tensors of checkpointed modules always
require grad, even if they shouldn't. This commit makes it so that
the outputs of checkpointed modules only require grad if either
the input requires grad or if the parameters require grad.

To achieve this, this commit also adds a new _unflattened_param_views
attribute to modules being flattened. This allows the checkpointing
to still access the parameters and check if gradients need to be
computed.
Co-authored-by: Alex Xiao <axiao@fb.com>

482944d9

10 Sep, 2021 1 commit
- capture default device when refreshing the params (#786) · e1f36346
  Benjamin Lefaudeux authored Sep 09, 2021
  
  e1f36346
07 Sep, 2021 1 commit

[test] Added disable_checkpointing unit test (#779) · e00dfd95

Achal Dixit authored Sep 08, 2021

* [test] Added disable_checkpointing unit test

* [test] Added disable_checkpointing unit test (Clean-up)

* [test] Added disable_checkpointing unit test (Clean-up)

e00dfd95