Commits · ef194cd2345055b142407bf75c58e1e2a2d0865e · OpenDAS / fairscale

17 Nov, 2021 2 commits

[feature] Add a OffloadConfig object to specify offloading params to disk. (#855) · ef194cd2
anj-s authored Nov 17, 2021
```
* fixed lint issues

* remove unused print statements

* add changelog entry

* [skip ci] fix lint errors
```
ef194cd2

Update changelog, removed meta.yml and requirements cleanup (#853) · 2bfa5a61

Anupam Bhatnagar authored Nov 17, 2021

* update changelog

* [skip ci] removed requirements-test.txt

* [skip ci] updating changelog

* [skip ci] add PR numbers

* replacing requirements-test.txt by requirements-dev.txt

* [skip ci] changing requirements-test to requirements-dev in pre-commit and requirements-benchmarks

* [skip ci] mark manual static analysis checks as deprecated

* empty commit to trigger ci

* [skip ci] updating changelog

* [skip ci] addressing comments

* addressing more comments

2bfa5a61

15 Nov, 2021 1 commit

Allow sharded grad scaler to cpu offload with FSDP (#831) · ba5785f7

Anupam Bhatnagar authored Nov 15, 2021

* first commit

* sharded scaler hitting nan assertions

* adding test for sharded grad scaler without cpu offload

* ddp grad scaler and fsdp sharded grad scaler test failing

* removing test_output

* fix no cpu offload test

* changing optimizer from OSS to SGD

* all tests passing, code cleanup pending

* code cleanup

* fix pyproject.toml

* removing .isort.cfg

* running isort linter

* resolving isort issues

* resolving black linter issue

* resolving mypy issues

* fix import statement

* fix mypy error

* modifying import statement

* adding pytorch version requirement

* fixing pytest skip test decorator

* apply version guard for ShardedGradScaler

* removing test_fsdp_grad_scaler

* increasing num_epochs for ShardedGradScaler so that updates are not skipped

* adding support for torch 1.8

* minor edit

* [skip ci] more torch 1.8 changes

* parametrizing the tests

* cleanup code with linters

* [skip ci] update doc string

* [skip ci] addressing some more comments

ba5785f7

12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

09 Nov, 2021 1 commit

CI config changes (#847) · 6f3931a4

Anupam Bhatnagar authored Nov 08, 2021

* CI config changes

* changing params for failing tests

* [skip ci] minor edit

6f3931a4

08 Nov, 2021 3 commits

[chore] 0.4.2 release (#846) · b65ce6ff

Anupam Bhatnagar authored Nov 08, 2021

* [chore] 0.4.2 release

* updating torch version

* [skip ci] updating readme and requirements.txt

b65ce6ff

[feature]Add support for SSD offload with FSDP for eval workloads (#839) · d7c4aa52

anj-s authored Nov 08, 2021

* update release notes

* initial commit

* lint cleanup etc.

* helper functions; lint errors

* lint errors

* lint errors

* add back the boolean for named_parameters

* address comments and fix lint

* remove unused functions and class

* remove unused state

d7c4aa52

[feat] Gossip/SlowMo (#378) · 21464e05

Benjamin Lefaudeux authored Nov 08, 2021



Add SlowMo Distributed Data Parallel for clusters with slow interconnects
Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>

21464e05

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

03 Nov, 2021 1 commit
- Update Sphinx version in docs requirements file (#841) · f327eb4a
  Vinayak Tantia authored Nov 03, 2021
  
  f327eb4a
02 Nov, 2021 2 commits
- fix github URL (#838) · 5ababa88
  anj-s authored Nov 02, 2021
  
  5ababa88
- update nightly torch and test the flaky test (#837) · 2aef7b84
  Min Xu authored Nov 01, 2021
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  2aef7b84
01 Nov, 2021 2 commits

[feat] [FSDP]: add experimental support to shared weights (#836) · f2af4c66

Min Xu authored Nov 01, 2021



* added a new test, passing without shared weights

* tested weight sharing

* added the test to test list file

* extended to world_size = 2

* fixed test

* [feat]: add limited and experimental support for shared parameter

* fixed tests

* simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param

* fixed the case where linked param is not in separate FSDP

* changelog and remove old code
Co-authored-by: Min Xu <min.xu.public@gmail.com>

f2af4c66

[feature] Add the low level SSD APIs (#829) · a9fcaa28

anj-s authored Nov 01, 2021

* add doc strings

* add lower level SSD APIs and tests

* add the test to the list to be run

* remove unused imports

* more doc string changes

* fix lint errors

a9fcaa28

28 Oct, 2021 1 commit

[fix] fix test on main (#835) · 28aa2dde

Min Xu authored Oct 28, 2021



* [fix] fix test on main

* [fix] fix test on main
Co-authored-by: Min Xu <min.xu.public@gmail.com>

28aa2dde

27 Oct, 2021 6 commits

[fix] Decouple `move_params_to_cpu` from the `mixed_precision`. (#822) · ed7ca766

anj-s authored Oct 27, 2021

* remove offload dependency on fp16

* update python version for cpu tess

* run CPU tests with updated PyTorch version

* split changes

* revert tests config

* fix lint errors

* update nightly and test PyTorch versions

* skip failing multiprocess pipe test

* always skip test

* always skip test

* always skip test

* lint error

* skip unsupported versions

* improve skip message

* lint errors

* modify docs

* add tests

* fix test failures

* modify comments

* fix lint errors

* fix lint errors

ed7ca766

[test] improve a test's coverage (#798) · b60f3db0

Min Xu authored Oct 27, 2021



* checkpoint + nonflat + mixed_precision

* make tests pass with expected errors

* addressed comments

* add a comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

b60f3db0

[feature] Skip creating the CPU grad tensor when training (#821) · 5f895f0b
anj-s authored Oct 27, 2021
```
* skip creating cpu grads and pinning memory

* added additional comment

* pin docutils to fix circleCI
```
5f895f0b

[fix]: Fixes an issue with pre_backward hook registering (#833) · 5da5c0eb

Min Xu authored Oct 27, 2021



* added the failing test

* fixed the bug

* fine-tune the condition

* typo

* typo

* changelog and added test to test files
Co-authored-by: Min Xu <min.xu.public@gmail.com>

5da5c0eb

update requirements files (#832) · cabad2f7
anj-s authored Oct 26, 2021

cabad2f7
Use correct node names for param counting in auto_shard. (#830) · 86c62cc9
Eugen Hotaj authored Oct 26, 2021
```
Fixes #827.
Co-authored-by: Eugen Hotaj <ehotaj@fb.com>
```
86c62cc9

24 Oct, 2021 1 commit
- [chore] Fix main breakage temporarily by relaxing constraints (#828) · eadfdc49
  anj-s authored Oct 23, 2021
```
* relax speed constraints

* relax the regressions constraints
```
  eadfdc49
22 Oct, 2021 2 commits

modify golden data (#825) · 35f327f3
anj-s authored Oct 22, 2021

35f327f3

Extend auto shard capabilities to work around torch.fx edge cases. (#817) · 7bdf50a3

Eugen Hotaj authored Oct 22, 2021

auto_shard.py currently uses torch.fx to create a symbolic DAG of
operations and linearizes that DAG into an nn.Sequential so it can later
be used for model offloading. This works in most cases but runs into
issues for certain eager mode features, such as dynamic conditionals,
shape-dependent computation, etc.

This PR extends auto_shard.py to first run a preprocessing step which wraps
any nn.Module which cannot be traced through. It adds a test for dynamic
conditionals and updates existing failing test code.

There are some immediate extensions to this approach which are marked as
TODO in the code.

7bdf50a3

21 Oct, 2021 2 commits

[chore] Update the PyTorch version that we run benchmarks with. (#823) · e4da75ea
anj-s authored Oct 21, 2021
```
* update pytorch version for benchmarks

* reduce golden data precision check
```
e4da75ea

[chore] Update the PyTorch version that we run CPU tests with (#809) · 11a24161

anj-s authored Oct 20, 2021

* update python version for cpu tess

* run CPU tests with updated PyTorch version

* update nightly and test PyTorch versions

* skip failing multiprocess pipe test

* always skip test

* always skip test

* always skip test

* lint error

* skip unsupported versions

* improve skip message

* lint errors

11a24161

20 Oct, 2021 3 commits

[chore] Add log for the new experimental memory tracker feature. (#819) · ce2ad89e
anj-s authored Oct 20, 2021
```
* add log for new memory tracker features

* add log for new memory tracker features
```
ce2ad89e

[feat] layer memory tracking (#808) · ad92220c

Quentin Duval authored Oct 20, 2021



* [feat] layer memory tracking

* [feat] layer memory tracking (add tests in CI)

* [feat] layer memory tracking: doc typos

* [feat] layer memory tracking: mypy fixes

* [feat] layer memory tracking: fixes for FSDP all gather tracking on pytorch 1.9 and above

* [feat] layer memory tracking: lint

* [feat] layer memory tracking: mypy
Co-authored-by: QuentinDuval <QuentinDuval@users.noreply.github.com>

ad92220c

remove deprecated func (#818) · 51e43b61
anj-s authored Oct 19, 2021

51e43b61

19 Oct, 2021 1 commit
- [FairScale] Remove refs to "cpu_offload" in code comments (#814) · fb7b6a93
  Rohan Varma authored Oct 19, 2021
```
* fix

* remove dup file
```
  fb7b6a93
28 Sep, 2021 1 commit
- revert accidental commit · 8acbec71
  Anjali Sridhar authored Sep 27, 2021
  
  8acbec71
24 Sep, 2021 1 commit
- simplify condiiton for readability · 180c9197
  Anjali Sridhar authored Sep 24, 2021
  
  180c9197
22 Sep, 2021 1 commit

Switch default branch from master to main (#807) · b09ddb2d

tmarkstrum authored Sep 22, 2021

* update master branch to main

* added FAQ about updating the branch from master to main

* fixed some false positive correction

* added what is new section

* fixed the quoted code area

* added release what is new section

* added a step in release.md

* fixed a word

b09ddb2d

21 Sep, 2021 1 commit
- Update offload_model.rst (#806) · fecb665b
  anj-s authored Sep 21, 2021
  
  fecb665b
20 Sep, 2021 1 commit
- [chore]0.4.1 release (#803) · 1b9be421
  tmarkstrum authored Sep 20, 2021
```
* [chore]0.4.1 release

* put more details in one change log
```
  1b9be421
17 Sep, 2021 1 commit

add toggler to disable the using the nccl base collectives (#799) · 086402d5

tmarkstrum authored Sep 17, 2021

* add toggler to disable the using the nccl base collectives

* added todo to remove the toggle when the issue is resolved.

086402d5

13 Sep, 2021 1 commit
- [OSS] Fixing the fp16 broadcast and catching this case in the unit test (#795) · 180ab8c8
  Benjamin Lefaudeux authored Sep 13, 2021
  
  180ab8c8
12 Sep, 2021 2 commits

[fix] minor fixes for master branch (#792) · 31e36453

Min Xu authored Sep 12, 2021



* add changelog for previous commit

* add changelog for previous commit

* add changelog for previous commit

* fix a merge induced error
Co-authored-by: Min Xu <min.xu.public@gmail.com>

31e36453

[fix] FSDP intra-backwards gradient accumulation. (#784) · 4fa2ab9b

Darryl Barnhart authored Sep 12, 2021

* [fix] FSDP intra-backwards gradient accumulation.

Ensure gradient reduction accumulates into the unsharded gradient tensor
within a backwards pass. This matters when an FSDP module is called
multiple times within a forward pass, and reduction is _not_ deferred
using activation checkpoint forward counters, bucketing or some other
mechanism.

Closes #780

* [refactor] Remove forward counters. Comments.

Removed forward counters from the activation checkpointing utility, now
that FSDP does not require them for correct operation. Add more detailed
comment about memory usage behaviour with gradient reduction.

* [refactor] Delete deprecated forward counter usage.

* [refactor] Add state assertion as end of pre-backward hook.

4fa2ab9b

11 Sep, 2021 1 commit

[feat] set requires_grad of output tensors of checkpointed modules properly (#787) · 482944d9

Alex Xiao authored Sep 10, 2021



Before this commit, output tensors of checkpointed modules always
require grad, even if they shouldn't. This commit makes it so that
the outputs of checkpointed modules only require grad if either
the input requires grad or if the parameters require grad.

To achieve this, this commit also adds a new _unflattened_param_views
attribute to modules being flattened. This allows the checkpointing
to still access the parameters and check if gradients need to be
computed.
Co-authored-by: Alex Xiao <axiao@fb.com>

482944d9