Commits · 11a241617f609617581e4ef9d326e2bc76639121 · OpenDAS / fairscale

21 Oct, 2021 1 commit

[chore] Update the PyTorch version that we run CPU tests with (#809) · 11a24161

anj-s authored Oct 20, 2021

* update python version for cpu tess

* run CPU tests with updated PyTorch version

* update nightly and test PyTorch versions

* skip failing multiprocess pipe test

* always skip test

* always skip test

* always skip test

* lint error

* skip unsupported versions

* improve skip message

* lint errors

11a24161

20 Oct, 2021 3 commits

[chore] Add log for the new experimental memory tracker feature. (#819) · ce2ad89e
anj-s authored Oct 20, 2021
```
* add log for new memory tracker features

* add log for new memory tracker features
```
ce2ad89e

[feat] layer memory tracking (#808) · ad92220c

Quentin Duval authored Oct 20, 2021



* [feat] layer memory tracking

* [feat] layer memory tracking (add tests in CI)

* [feat] layer memory tracking: doc typos

* [feat] layer memory tracking: mypy fixes

* [feat] layer memory tracking: fixes for FSDP all gather tracking on pytorch 1.9 and above

* [feat] layer memory tracking: lint

* [feat] layer memory tracking: mypy
Co-authored-by: QuentinDuval <QuentinDuval@users.noreply.github.com>

ad92220c

remove deprecated func (#818) · 51e43b61
anj-s authored Oct 19, 2021

51e43b61

19 Oct, 2021 1 commit
- [FairScale] Remove refs to "cpu_offload" in code comments (#814) · fb7b6a93
  Rohan Varma authored Oct 19, 2021
```
* fix

* remove dup file
```
  fb7b6a93
28 Sep, 2021 1 commit
- revert accidental commit · 8acbec71
  Anjali Sridhar authored Sep 27, 2021
  
  8acbec71
24 Sep, 2021 1 commit
- simplify condiiton for readability · 180c9197
  Anjali Sridhar authored Sep 24, 2021
  
  180c9197
22 Sep, 2021 1 commit

Switch default branch from master to main (#807) · b09ddb2d

tmarkstrum authored Sep 22, 2021

* update master branch to main

* added FAQ about updating the branch from master to main

* fixed some false positive correction

* added what is new section

* fixed the quoted code area

* added release what is new section

* added a step in release.md

* fixed a word

b09ddb2d

21 Sep, 2021 1 commit
- Update offload_model.rst (#806) · fecb665b
  anj-s authored Sep 21, 2021
  
  fecb665b
20 Sep, 2021 1 commit
- [chore]0.4.1 release (#803) · 1b9be421
  tmarkstrum authored Sep 20, 2021
```
* [chore]0.4.1 release

* put more details in one change log
```
  1b9be421
17 Sep, 2021 1 commit

add toggler to disable the using the nccl base collectives (#799) · 086402d5

tmarkstrum authored Sep 17, 2021

* add toggler to disable the using the nccl base collectives

* added todo to remove the toggle when the issue is resolved.

086402d5

13 Sep, 2021 1 commit
- [OSS] Fixing the fp16 broadcast and catching this case in the unit test (#795) · 180ab8c8
  Benjamin Lefaudeux authored Sep 13, 2021
  
  180ab8c8
12 Sep, 2021 2 commits

[fix] minor fixes for master branch (#792) · 31e36453

Min Xu authored Sep 12, 2021



* add changelog for previous commit

* add changelog for previous commit

* add changelog for previous commit

* fix a merge induced error
Co-authored-by: Min Xu <min.xu.public@gmail.com>

31e36453

[fix] FSDP intra-backwards gradient accumulation. (#784) · 4fa2ab9b

Darryl Barnhart authored Sep 12, 2021

* [fix] FSDP intra-backwards gradient accumulation.

Ensure gradient reduction accumulates into the unsharded gradient tensor
within a backwards pass. This matters when an FSDP module is called
multiple times within a forward pass, and reduction is _not_ deferred
using activation checkpoint forward counters, bucketing or some other
mechanism.

Closes #780

* [refactor] Remove forward counters. Comments.

Removed forward counters from the activation checkpointing utility, now
that FSDP does not require them for correct operation. Add more detailed
comment about memory usage behaviour with gradient reduction.

* [refactor] Delete deprecated forward counter usage.

* [refactor] Add state assertion as end of pre-backward hook.

4fa2ab9b

11 Sep, 2021 1 commit

[feat] set requires_grad of output tensors of checkpointed modules properly (#787) · 482944d9

Alex Xiao authored Sep 10, 2021



Before this commit, output tensors of checkpointed modules always
require grad, even if they shouldn't. This commit makes it so that
the outputs of checkpointed modules only require grad if either
the input requires grad or if the parameters require grad.

To achieve this, this commit also adds a new _unflattened_param_views
attribute to modules being flattened. This allows the checkpointing
to still access the parameters and check if gradients need to be
computed.
Co-authored-by: Alex Xiao <axiao@fb.com>

482944d9

10 Sep, 2021 2 commits
- [doc]: updating FSDP example (#788) · 3fb8aa2b
  Min Xu authored Sep 10, 2021
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  3fb8aa2b
- capture default device when refreshing the params (#786) · e1f36346
  Benjamin Lefaudeux authored Sep 09, 2021
  
  e1f36346
07 Sep, 2021 1 commit

[test] Added disable_checkpointing unit test (#779) · e00dfd95

Achal Dixit authored Sep 08, 2021

* [test] Added disable_checkpointing unit test

* [test] Added disable_checkpointing unit test (Clean-up)

* [test] Added disable_checkpointing unit test (Clean-up)

e00dfd95

06 Sep, 2021 2 commits

Add method for disabling gradient checkpointing (#772) · 4f7f0853
Tim Brooks authored Sep 05, 2021
```
See https://github.com/facebookresearch/fairscale/issues/771
```
4f7f0853

[cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup;... · 3ecf76f4

Min Xu authored Sep 05, 2021


[cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup; pre-commit documentation (#744)

* changelog; mypy; oss cleanup

* more broadcast_object cleanup in FSDP

* one more mypy fix

* retire pytorch 1.6 from circleci, add new lightly, add 1.8 LTS and 1.9 stable release

* update torch version for LTS

* minor fixes

* update cache key

* trying newer gpu VMs

* bump the cache

* update to gpu.medium, which should be 2 GPUs

* update nightly version

* add pre-commit instruction

* fixed CHANGELOG after merging

* updated to newer nightly

* retained the older broadcast function for older GPUs for oss.py

* fixed a bug

* added a comment

* fixing a test for pytorch 1.10

* testing a fix

* Update fairscale/optim/oss.py

* Update CONTRIBUTING.md
Co-authored-by: Min Xu <min.xu.public@gmail.com>

3ecf76f4

05 Sep, 2021 1 commit

[fix] [FSDP] making sure we use full params for multiple backwards within an iteration (#775) · 95d31d4d

Min Xu authored Sep 05, 2021



* [bug] [FSDP] making sure we use full params for multiple backwards within an iteration

* changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

95d31d4d

18 Aug, 2021 1 commit
- fix BibTeX entry · c10447f9
  Vittorio Caggiano authored Aug 18, 2021
  
  c10447f9
12 Aug, 2021 4 commits

add changelog for PRs submitted (#764) · d54e183c
anj-s authored Aug 12, 2021

d54e183c

[minor] RELEASE.md and pre-commit (#762) · f2852ad7

Min Xu authored Aug 12, 2021



* minor: changelog and pre-commit

* addressed comment

* update the release doc
Co-authored-by: Min Xu <min.xu.public@gmail.com>

f2852ad7

[fix] Add an additional assert for checking if the params of a module requires_grad=True (#761) · 73f73120
anj-s authored Aug 11, 2021
```
* add additional assert for checking if the requires_grad field is set.

* fix lint errors

* add unit tests and address comments
```
73f73120

[FSDP][feature] Support returning the original parameter names after a model... · a825348d

anj-s authored Aug 11, 2021

[FSDP][feature] Support returning the original parameter names after a model has been wrapped with FSDP (#755)

* checkpoint work

* fix lint issues

* remove debug statement

* remove print

* fix lint errors

* fix lint errors

* fix lint errors

* add comments and fix lint errors

* modified comments and tests

a825348d

10 Aug, 2021 1 commit

Fix pre-commit hook failures (#756) · 31d600cc

Rahul Iyer authored Aug 09, 2021

Pre-commit hook fails when run on all files for three reasons:
(see trace below)

1. Trailing whitespace on multiple files
2. mypy fails to load numpy and then subsequently fails to load
LazyModule from pipe.py
3. isort sees issues with known_third_party packages

```
> pre-commit run --all-files

Trim Trailing Whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing docs/source/conf.py
Fixing fairscale/experimental/nn/auto_shard.py
Fixing docs/source/deep_dive/activation_checkpointing.rst
Fixing docs/source/tutorials/pipe.rst
Fixing docs/source/installation_instructions.rst
Fixing docs/source/deep_dive/pipeline_parallelism.rst
Fixing docs/source/tutorials/activation_checkpointing.rst
Fixing docs/source/tutorials/offload_model.rst
Fixing docs/source/deep_dive/oss_sdp_fsdp.rst
Fixing docs/source/what_is_fairscale.rst
Fixing CHANGELOG.md
Fixing fairscale/experimental/nn/offload.py
Fixing docs/source/index.rst
Fixing docs/source/deep_dive/adascale.rst
Fixing README.md
Fixing docs/source/tutorials/oss.rst
Fixing docs/source/deep_dive/offload.rst

Check python ast.........................................................Passed
Check for merge conflicts................................................Passed
Don't commit to branch...................................................Passed
Check for added large files..............................................Passed
Fix End of Files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing requirements.txt
Fixing docs/source/getting_started.rst
Fixing docs/source/installation_instructions.rst
Fixing codecov.yml
Fixing docs/source/deep_dive/adascale.rst
Fixing docs/source/tutorials/oss.rst
Fixing docs/source/deep_dive/offload.rst

black....................................................................Passed
flake8...................................................................Passed
seed isort known_third_party.............................................Failed
- hook id: seed-isort-config
- exit code: 1
- files were modified by this hook
isort....................................................................Passed
mypy.....................................................................Failed
- hook id: mypy
- exit code: 2

setup.cfg:45: error: Error importing plugin 'numpy.typing.mypy_plugin': No module named 'numpy'
Found 1 error in 1 file (checked 197 source files)
```

31d600cc

02 Aug, 2021 2 commits
- Adding missing imports in docstring (#760) · 2dc2617c
  mrshenli authored Aug 02, 2021
```
`wrap` from `auto_wrap` is used in the docstring example which is missing from the imports.
```
  2dc2617c
- Change test to use tensorpipe rpc backend (#759) · 57821dd2
  Howard Huang authored Aug 02, 2021
  
  57821dd2
01 Aug, 2021 1 commit
- [chore] 0.4.0 release (#757) · 3e661603
  Min Xu authored Jul 31, 2021
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  3e661603
31 Jul, 2021 1 commit

FSDP: supporting gradient accumulation without no_sync context manager to save GPU memory (#752) · cd0f0b88

Myle Ott authored Jul 31, 2021



* Add test (broken) for gradient accumulation without no_sync context manager

* changelog

* no_sync to grad_acc renaming for tests

* clean up tmp files

* support grad acc without no_sync

* minor

* update changelog

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Better assertion from Sam.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

cd0f0b88

30 Jul, 2021 1 commit

[FSDP] Move final backward callback queueing to pre-backward hook of root instance (#753) · ba7df621

Yanli Zhao authored Jul 30, 2021

Move final backward callback to pre-backward hook of root FSDP instance

Summary:

Move final backward callback to pre-backward hook of root FSDP instance,
so that it is always attached to the outer most backward call and fired
after all backward calls are completed.

Also added flags to check final backward callback is fired when final
backward callback is required.

If root FSDP is checkpointed and called multiple times in forward,
check pointer counter is used to make sure final backward callback is queued inside last inner backward
call as well.

Test Plan: unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

* reformat

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* nits and unit tests

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* address some comments

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* replace m with self
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* reformat

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* nits

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* remove the fired flag

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* assert state on root only

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* comments

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* comments

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ba7df621

27 Jul, 2021 2 commits
- [chore] 0.3.9 release (#750) · 61ece000
  Min Xu authored Jul 27, 2021
```
* [chore] 0.3.9 release

* update changelog

* address comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  61ece000
- [fix] OSS fp16 broadcast typo (#751) · b46dcfaf
  Benjamin Lefaudeux authored Jul 27, 2021
  
  b46dcfaf
26 Jul, 2021 1 commit

[feat]: prepare FSDP to handle multiple flatten params and fixed metadata saving for MoE (#746) · 83b0b49e

Min Xu authored Jul 26, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 3: make FSDP use FlattenParamModule unconditionally

* fixing the auto_wrap tests

* minor

* rewrite local_metadata_dict

- updated FPW so that custom flat param name is also supported

* bug fix

* mypy

* rewrote consolidate_shard_weights

- test_consolidate passes

* comments

* fixing pickling

* Fix shared params and MoE logic (#749)

* add strict kwarg to support fairseq:gshard MoE saving logic

* Test fairseq style shard

* style

* formatting and address comments

* added changelog

* fixing a test after padding renaming
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

83b0b49e

19 Jul, 2021 1 commit

FSDP use _allgather_base and _reduce_scatter_base (#729) · 8bca4f87

liangluofb authored Jul 19, 2021



* Update fully_sharded_data_parallel.py

update fully_sharded_data_parallel to use _allgather_base

* Update reduce_scatter_bucketer.py

Use reduce_scatter_base

* Update fully_sharded_data_parallel.py

nonblocking gradient cpu copy, and nonblocking param rebulds

* Update reduce_scatter_bucketer.py

lints

* Update fully_sharded_data_parallel.py

* Update reduce_scatter_bucketer.py

* Update reduce_scatter_bucketer.py

* lints

* linter, test fix

* linter

* LINTERgit add fairscale/utils/reduce_scatter_bucketer.pygit add fairscale/utils/reduce_scatter_bucketer.py

* LINTERgit add tests/nn/data_parallel/test_fsdp_overlap.pygit add tests/nn/data_parallel/test_fsdp_overlap.py

* Update test_fsdp_overlap.py

* Update fairscale/utils/reduce_scatter_bucketer.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Update reduce_scatter_bucketer.py

* isort
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-185.ec2.internal>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-77-164.ec2.internal>

8bca4f87

12 Jul, 2021 2 commits
- [chore] 0.3.8 release (#739) · 782714a8
  anj-s authored Jul 12, 2021
  
  782714a8
- Update README.md · 86fdebd8
  Vittorio Caggiano authored Jul 12, 2021
```
misspelled name
```
  86fdebd8
07 Jul, 2021 1 commit

Future proof storage size test (#735) · 8d82db43

Edward Z. Yang authored Jul 06, 2021

See https://github.com/pytorch/pytorch/pull/59671/

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

8d82db43

28 Jun, 2021 1 commit
- remove numpy requirement for install and add it only for tests (#732) · d442ad18
  anj-s authored Jun 28, 2021
  
  d442ad18