Commits · 3b8f445f4b723e2dd36cffc0dd4dfd48d2258986 · OpenDAS / fairscale

28 Jan, 2022 1 commit

[feat] add CosFace paper's LMCL to MEVO (#916) · 89e1ae5f

Min Xu authored Jan 27, 2022



* [feat] add CosFace paper's LMCL to MEVO

- added baseline algorithm to the reference kernel
- added MEVO version of LMCL
- added unit test to verify it is correct with respect to the reference as well as its memory usage

* updated changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

89e1ae5f

14 Jan, 2022 1 commit

[Chore]release 0.4.5 (#911) · 4a3bd93a

tmarkstrum authored Jan 14, 2022

* release 0.4.5

* added some content for the release

* fixed a format issue.

4a3bd93a

13 Jan, 2022 2 commits

[feature] [experimental] Layerwise Gradient Scaler (#879) · 52d066a2

Anupam Bhatnagar authored Jan 12, 2022

* [skip ci] first commit

* [skip ci] gradient scaler example

* [skip ci] adding feed forward toy example

* [skip ci] adding types

* [skip ci] adding backward hook

* [skip ci] update

* [skip ci] working feed forward example

* [skip ci] working feed forward example

* [skip ci] use named_modules instead of named_children

* [skip ci] adding new file

* [skip ci] clean up

* [skip ci] implement unscale function

* [skip ci] implement unscale function

* [skip ci] removing old file

* [skip ci] removing some more old files

* [skip ci] making unscale function generic

* [skip ci] adding test for vision model

* [skip ci] adding identity layer

* [skip ci] cleanup files

* [skip ci] refactoring

* [skip ci] more refactoring

* [skip ci] added functionality to update scale

* [skip ci] data loader clean up

* [skip ci] implemented inf checks and update scale functions

* [skip ci]code clean up. added test with autocast. does not work atm

* adding documentation

* adding dependency in requirements-dev.txt

* updating pytorch nightly version

* updating changelog

* adding is_cuda_available to test_vision_model

* set same timeout on cpu and gpu

* reverting cpu timeout, skip vision test on cpu

* addressing comments, fixing vision test

* unscale uses in-place matmul

* some more cleanup

52d066a2

[Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19

tmarkstrum authored Jan 12, 2022



* fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* added changelog

* fixed some commit.

* added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.

* throw an error instead of rolling back to use default process group for reduce_scatter_process_group

* Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"

This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.

* added check for None to avoid unit test failure

* fixed an error to avoid the unit tests failure
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

fb4eca19

12 Jan, 2022 1 commit

[chore] Update the CHANGELOG to add details about the new feature that enables... · 0044372c

tmarkstrum authored Jan 11, 2022

[chore] Update the CHANGELOG to add details about the new feature that enables reduce_scatter overlap in backward propagation (#906)

* updated the change log

* improve the change log

0044372c

06 Jan, 2022 1 commit

FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb

four4fish authored Jan 06, 2022

* FullyShardedDataParallel: only return full state dict on rank 0

* Add flag and make rank 0 only optional

* Add tests

* Add docs

* address comments

* update comments

* update torch nightly version

* update torchvision number for torch nightly dependence

* add changelog

* Update CHANGELOG.md

* Update CHANGELOG.md

d3417ceb

05 Jan, 2022 1 commit

Enabling ssd_offload training basic tests. (#887) · c5e471bc

Paul Johnson authored Jan 05, 2022

* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
* Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
* Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
* Modifications from PR #887 review comments.
* Update Changelog

c5e471bc

21 Dec, 2021 3 commits
- [skip ci] updating date in changelog (#892) · 8397f766
  Anupam Bhatnagar authored Dec 21, 2021
  
  8397f766
- Changelog update (#891) · 8e770bac
  Anupam Bhatnagar authored Dec 21, 2021
```
* [skip ci] adding comments to changelog

* adding date to changelog

* [skip ci] minor edit
```
  8e770bac
- [Fix] - Finiteness check for all tensors (#890) · c3fc3894
  Anupam Bhatnagar authored Dec 21, 2021
```
* Finiteness check for all tensors

* [skip ci] updating changelog
```
  c3fc3894
02 Dec, 2021 1 commit

[fix] [FSDP] Do not lose original reshard_after_forward (#880) · 7c2c3e00

Min Xu authored Dec 02, 2021

* [fix] [FSDP] Do not lose original reshard_after_forward

- In a corner case we can lose this value
- Saving it and use it in the reset function fixed it
- A trivial case probably not worth a dedicated test for now

* added changelog

7c2c3e00

18 Nov, 2021 2 commits

[chore] 0.4.3 release (#860) · 68d10f73

Min Xu authored Nov 18, 2021



* [chore] 0.4.3 release

* update setup.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

68d10f73

[fix] [MEVO]: make mevo work with eval and optim_state checkpointing (#851) · 0db50ce5

Min Xu authored Nov 18, 2021



* [fix]: fix eval for shared weight FSDP

* fixing optim state saving

* add changelog

* reformat with newer local isort

* update test

* avoid computing reference state unless we are testing training

* added optim_state test

* make mypy happy

* move tests; maybe we need to CUDA memory related tests in the first of the lists
Co-authored-by: Min Xu <min.xu.public@gmail.com>

0db50ce5

17 Nov, 2021 2 commits

[feature] Add a OffloadConfig object to specify offloading params to disk. (#855) · ef194cd2
anj-s authored Nov 17, 2021
```
* fixed lint issues

* remove unused print statements

* add changelog entry

* [skip ci] fix lint errors
```
ef194cd2

Update changelog, removed meta.yml and requirements cleanup (#853) · 2bfa5a61

Anupam Bhatnagar authored Nov 17, 2021

* update changelog

* [skip ci] removed requirements-test.txt

* [skip ci] updating changelog

* [skip ci] add PR numbers

* replacing requirements-test.txt by requirements-dev.txt

* [skip ci] changing requirements-test to requirements-dev in pre-commit and requirements-benchmarks

* [skip ci] mark manual static analysis checks as deprecated

* empty commit to trigger ci

* [skip ci] updating changelog

* [skip ci] addressing comments

* addressing more comments

2bfa5a61

12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

08 Nov, 2021 3 commits

[chore] 0.4.2 release (#846) · b65ce6ff

Anupam Bhatnagar authored Nov 08, 2021

* [chore] 0.4.2 release

* updating torch version

* [skip ci] updating readme and requirements.txt

b65ce6ff

[feature]Add support for SSD offload with FSDP for eval workloads (#839) · d7c4aa52

anj-s authored Nov 08, 2021

* update release notes

* initial commit

* lint cleanup etc.

* helper functions; lint errors

* lint errors

* lint errors

* add back the boolean for named_parameters

* address comments and fix lint

* remove unused functions and class

* remove unused state

d7c4aa52

[feat] Gossip/SlowMo (#378) · 21464e05

Benjamin Lefaudeux authored Nov 08, 2021



Add SlowMo Distributed Data Parallel for clusters with slow interconnects
Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>

21464e05

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

01 Nov, 2021 1 commit

[feat] [FSDP]: add experimental support to shared weights (#836) · f2af4c66

Min Xu authored Nov 01, 2021



* added a new test, passing without shared weights

* tested weight sharing

* added the test to test list file

* extended to world_size = 2

* fixed test

* [feat]: add limited and experimental support for shared parameter

* fixed tests

* simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param

* fixed the case where linked param is not in separate FSDP

* changelog and remove old code
Co-authored-by: Min Xu <min.xu.public@gmail.com>

f2af4c66

27 Oct, 2021 1 commit

[fix]: Fixes an issue with pre_backward hook registering (#833) · 5da5c0eb

Min Xu authored Oct 27, 2021



* added the failing test

* fixed the bug

* fine-tune the condition

* typo

* typo

* changelog and added test to test files
Co-authored-by: Min Xu <min.xu.public@gmail.com>

5da5c0eb

20 Oct, 2021 1 commit
- [chore] Add log for the new experimental memory tracker feature. (#819) · ce2ad89e
  anj-s authored Oct 20, 2021
```
* add log for new memory tracker features

* add log for new memory tracker features
```
  ce2ad89e
20 Sep, 2021 1 commit
- [chore]0.4.1 release (#803) · 1b9be421
  tmarkstrum authored Sep 20, 2021
```
* [chore]0.4.1 release

* put more details in one change log
```
  1b9be421
13 Sep, 2021 1 commit
- [OSS] Fixing the fp16 broadcast and catching this case in the unit test (#795) · 180ab8c8
  Benjamin Lefaudeux authored Sep 13, 2021
  
  180ab8c8
12 Sep, 2021 1 commit

[fix] minor fixes for master branch (#792) · 31e36453

Min Xu authored Sep 12, 2021



* add changelog for previous commit

* add changelog for previous commit

* add changelog for previous commit

* fix a merge induced error
Co-authored-by: Min Xu <min.xu.public@gmail.com>

31e36453

05 Sep, 2021 1 commit

[fix] [FSDP] making sure we use full params for multiple backwards within an iteration (#775) · 95d31d4d

Min Xu authored Sep 05, 2021



* [bug] [FSDP] making sure we use full params for multiple backwards within an iteration

* changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

95d31d4d

12 Aug, 2021 2 commits
- add changelog for PRs submitted (#764) · d54e183c
  anj-s authored Aug 12, 2021
  
  d54e183c
- [minor] RELEASE.md and pre-commit (#762) · f2852ad7
  Min Xu authored Aug 12, 2021
```
* minor: changelog and pre-commit

* addressed comment

* update the release doc
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  f2852ad7
01 Aug, 2021 1 commit
- [chore] 0.4.0 release (#757) · 3e661603
  Min Xu authored Jul 31, 2021
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  3e661603
31 Jul, 2021 1 commit

FSDP: supporting gradient accumulation without no_sync context manager to save GPU memory (#752) · cd0f0b88

Myle Ott authored Jul 31, 2021



* Add test (broken) for gradient accumulation without no_sync context manager

* changelog

* no_sync to grad_acc renaming for tests

* clean up tmp files

* support grad acc without no_sync

* minor

* update changelog

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Better assertion from Sam.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

cd0f0b88

27 Jul, 2021 2 commits
- [chore] 0.3.9 release (#750) · 61ece000
  Min Xu authored Jul 27, 2021
```
* [chore] 0.3.9 release

* update changelog

* address comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  61ece000
- [fix] OSS fp16 broadcast typo (#751) · b46dcfaf
  Benjamin Lefaudeux authored Jul 27, 2021
  
  b46dcfaf
26 Jul, 2021 1 commit

[feat]: prepare FSDP to handle multiple flatten params and fixed metadata saving for MoE (#746) · 83b0b49e

Min Xu authored Jul 26, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 3: make FSDP use FlattenParamModule unconditionally

* fixing the auto_wrap tests

* minor

* rewrite local_metadata_dict

- updated FPW so that custom flat param name is also supported

* bug fix

* mypy

* rewrote consolidate_shard_weights

- test_consolidate passes

* comments

* fixing pickling

* Fix shared params and MoE logic (#749)

* add strict kwarg to support fairseq:gshard MoE saving logic

* Test fairseq style shard

* style

* formatting and address comments

* added changelog

* fixing a test after padding renaming
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

83b0b49e

12 Jul, 2021 1 commit
- [chore] 0.3.8 release (#739) · 782714a8
  anj-s authored Jul 12, 2021
  
  782714a8
21 Jun, 2021 1 commit

[feat] FSDP: supporting multiple flatten parameter groups (#711) · ab71efb3

Min Xu authored Jun 21, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 2: extending FPW to support multiple flat params groups
- FSDP still only use one group
- unit test does this the new code paths
- updated the changelog

* first cut, mypy passed

* test_flatten_params_wrapper.py::TestFlattenParams tests pass

* added two more test cases and fixed a case in the code

* fixed one bug with param_path_infos

* fixed two more tests with hardcoded flat_param names

* Update CHANGELOG.md
Co-authored-by: Min Xu <min.xu.public@gmail.com>

ab71efb3

11 Jun, 2021 1 commit

Use original forward pass directly when in eval mode from within checkpoint wrapper (#709) · 370b8483

Pete authored Jun 10, 2021

* add failing test

* add fix

* use 'torch.is_grad_enabled()' instead of 'module.training'

* Revert "add failing test"

This reverts commit 1c34242208f9b2c5fa6c8f181434c2be6d7cdbc0.

* add simple test

* improve test

* add check for fwd_counter

* revert typing/format changes

* move to new test file

* CHANGELOG

* remove old test

* fix import order

* fix test to be compat with torch 1.6.0

* clean up

* comments

* isort 🤦

370b8483

01 Jun, 2021 1 commit
- Fix buffer dtype in ` FSDP.state_dict()` when using mixed precision (#705) · 25cebf85
  Pete authored Jun 01, 2021
```
* add failing test for buffer dtype

* fix buffer dtype issue

* update CHANGELOG

* fix
```
  25cebf85
28 May, 2021 1 commit

[fix] using dummy tensor to ensure checkpoint backward pass is called in corner cases (#701) · df7db85c

Min Xu authored May 28, 2021



* [do not merge] testing a corner case

* workaround

* using dummy tensor to fix

* lint

* changelog

* update a comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

df7db85c

18 May, 2021 1 commit

[chore] 0.3.7 release (#686) · a462df2e

Min Xu authored May 17, 2021



* [chore] 0.3.7 release

* fixed changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

a462df2e