Commits · 0db50ce556a6755ef7eb5cecc0dded9bef10ca85 · OpenDAS / fairscale

18 Nov, 2021 1 commit

[fix] [MEVO]: make mevo work with eval and optim_state checkpointing (#851) · 0db50ce5

Min Xu authored Nov 18, 2021



* [fix]: fix eval for shared weight FSDP

* fixing optim state saving

* add changelog

* reformat with newer local isort

* update test

* avoid computing reference state unless we are testing training

* added optim_state test

* make mypy happy

* move tests; maybe we need to CUDA memory related tests in the first of the lists
Co-authored-by: Min Xu <min.xu.public@gmail.com>

0db50ce5

17 Nov, 2021 2 commits

[feature] Add a OffloadConfig object to specify offloading params to disk. (#855) · ef194cd2
anj-s authored Nov 17, 2021
```
* fixed lint issues

* remove unused print statements

* add changelog entry

* [skip ci] fix lint errors
```
ef194cd2

Update changelog, removed meta.yml and requirements cleanup (#853) · 2bfa5a61

Anupam Bhatnagar authored Nov 17, 2021

* update changelog

* [skip ci] removed requirements-test.txt

* [skip ci] updating changelog

* [skip ci] add PR numbers

* replacing requirements-test.txt by requirements-dev.txt

* [skip ci] changing requirements-test to requirements-dev in pre-commit and requirements-benchmarks

* [skip ci] mark manual static analysis checks as deprecated

* empty commit to trigger ci

* [skip ci] updating changelog

* [skip ci] addressing comments

* addressing more comments

2bfa5a61

12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

08 Nov, 2021 3 commits

[chore] 0.4.2 release (#846) · b65ce6ff

Anupam Bhatnagar authored Nov 08, 2021

* [chore] 0.4.2 release

* updating torch version

* [skip ci] updating readme and requirements.txt

b65ce6ff

[feature]Add support for SSD offload with FSDP for eval workloads (#839) · d7c4aa52

anj-s authored Nov 08, 2021

* update release notes

* initial commit

* lint cleanup etc.

* helper functions; lint errors

* lint errors

* lint errors

* add back the boolean for named_parameters

* address comments and fix lint

* remove unused functions and class

* remove unused state

d7c4aa52

[feat] Gossip/SlowMo (#378) · 21464e05

Benjamin Lefaudeux authored Nov 08, 2021



Add SlowMo Distributed Data Parallel for clusters with slow interconnects
Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>

21464e05

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

01 Nov, 2021 1 commit

[feat] [FSDP]: add experimental support to shared weights (#836) · f2af4c66

Min Xu authored Nov 01, 2021



* added a new test, passing without shared weights

* tested weight sharing

* added the test to test list file

* extended to world_size = 2

* fixed test

* [feat]: add limited and experimental support for shared parameter

* fixed tests

* simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param

* fixed the case where linked param is not in separate FSDP

* changelog and remove old code
Co-authored-by: Min Xu <min.xu.public@gmail.com>

f2af4c66

27 Oct, 2021 1 commit

[fix]: Fixes an issue with pre_backward hook registering (#833) · 5da5c0eb

Min Xu authored Oct 27, 2021



* added the failing test

* fixed the bug

* fine-tune the condition

* typo

* typo

* changelog and added test to test files
Co-authored-by: Min Xu <min.xu.public@gmail.com>

5da5c0eb

20 Oct, 2021 1 commit
- [chore] Add log for the new experimental memory tracker feature. (#819) · ce2ad89e
  anj-s authored Oct 20, 2021
```
* add log for new memory tracker features

* add log for new memory tracker features
```
  ce2ad89e
20 Sep, 2021 1 commit
- [chore]0.4.1 release (#803) · 1b9be421
  tmarkstrum authored Sep 20, 2021
```
* [chore]0.4.1 release

* put more details in one change log
```
  1b9be421
13 Sep, 2021 1 commit
- [OSS] Fixing the fp16 broadcast and catching this case in the unit test (#795) · 180ab8c8
  Benjamin Lefaudeux authored Sep 13, 2021
  
  180ab8c8
12 Sep, 2021 1 commit

[fix] minor fixes for master branch (#792) · 31e36453

Min Xu authored Sep 12, 2021



* add changelog for previous commit

* add changelog for previous commit

* add changelog for previous commit

* fix a merge induced error
Co-authored-by: Min Xu <min.xu.public@gmail.com>

31e36453

05 Sep, 2021 1 commit

[fix] [FSDP] making sure we use full params for multiple backwards within an iteration (#775) · 95d31d4d

Min Xu authored Sep 05, 2021



* [bug] [FSDP] making sure we use full params for multiple backwards within an iteration

* changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

95d31d4d

12 Aug, 2021 2 commits
- add changelog for PRs submitted (#764) · d54e183c
  anj-s authored Aug 12, 2021
  
  d54e183c
- [minor] RELEASE.md and pre-commit (#762) · f2852ad7
  Min Xu authored Aug 12, 2021
```
* minor: changelog and pre-commit

* addressed comment

* update the release doc
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  f2852ad7
01 Aug, 2021 1 commit
- [chore] 0.4.0 release (#757) · 3e661603
  Min Xu authored Jul 31, 2021
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  3e661603
31 Jul, 2021 1 commit

FSDP: supporting gradient accumulation without no_sync context manager to save GPU memory (#752) · cd0f0b88

Myle Ott authored Jul 31, 2021



* Add test (broken) for gradient accumulation without no_sync context manager

* changelog

* no_sync to grad_acc renaming for tests

* clean up tmp files

* support grad acc without no_sync

* minor

* update changelog

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Better assertion from Sam.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

cd0f0b88

27 Jul, 2021 2 commits
- [chore] 0.3.9 release (#750) · 61ece000
  Min Xu authored Jul 27, 2021
```
* [chore] 0.3.9 release

* update changelog

* address comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  61ece000
- [fix] OSS fp16 broadcast typo (#751) · b46dcfaf
  Benjamin Lefaudeux authored Jul 27, 2021
  
  b46dcfaf
26 Jul, 2021 1 commit

[feat]: prepare FSDP to handle multiple flatten params and fixed metadata saving for MoE (#746) · 83b0b49e

Min Xu authored Jul 26, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 3: make FSDP use FlattenParamModule unconditionally

* fixing the auto_wrap tests

* minor

* rewrite local_metadata_dict

- updated FPW so that custom flat param name is also supported

* bug fix

* mypy

* rewrote consolidate_shard_weights

- test_consolidate passes

* comments

* fixing pickling

* Fix shared params and MoE logic (#749)

* add strict kwarg to support fairseq:gshard MoE saving logic

* Test fairseq style shard

* style

* formatting and address comments

* added changelog

* fixing a test after padding renaming
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

83b0b49e

12 Jul, 2021 1 commit
- [chore] 0.3.8 release (#739) · 782714a8
  anj-s authored Jul 12, 2021
  
  782714a8
21 Jun, 2021 1 commit

[feat] FSDP: supporting multiple flatten parameter groups (#711) · ab71efb3

Min Xu authored Jun 21, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 2: extending FPW to support multiple flat params groups
- FSDP still only use one group
- unit test does this the new code paths
- updated the changelog

* first cut, mypy passed

* test_flatten_params_wrapper.py::TestFlattenParams tests pass

* added two more test cases and fixed a case in the code

* fixed one bug with param_path_infos

* fixed two more tests with hardcoded flat_param names

* Update CHANGELOG.md
Co-authored-by: Min Xu <min.xu.public@gmail.com>

ab71efb3

11 Jun, 2021 1 commit

Use original forward pass directly when in eval mode from within checkpoint wrapper (#709) · 370b8483

Pete authored Jun 10, 2021

* add failing test

* add fix

* use 'torch.is_grad_enabled()' instead of 'module.training'

* Revert "add failing test"

This reverts commit 1c34242208f9b2c5fa6c8f181434c2be6d7cdbc0.

* add simple test

* improve test

* add check for fwd_counter

* revert typing/format changes

* move to new test file

* CHANGELOG

* remove old test

* fix import order

* fix test to be compat with torch 1.6.0

* clean up

* comments

* isort 🤦

370b8483

01 Jun, 2021 1 commit
- Fix buffer dtype in ` FSDP.state_dict()` when using mixed precision (#705) · 25cebf85
  Pete authored Jun 01, 2021
```
* add failing test for buffer dtype

* fix buffer dtype issue

* update CHANGELOG

* fix
```
  25cebf85
28 May, 2021 1 commit

[fix] using dummy tensor to ensure checkpoint backward pass is called in corner cases (#701) · df7db85c

Min Xu authored May 28, 2021



* [do not merge] testing a corner case

* workaround

* using dummy tensor to fix

* lint

* changelog

* update a comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

df7db85c

18 May, 2021 1 commit

[chore] 0.3.7 release (#686) · a462df2e

Min Xu authored May 17, 2021



* [chore] 0.3.7 release

* fixed changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

a462df2e

17 May, 2021 2 commits

[fix] auto_wrap: support wrapping based on wrapper_config (#685) · 9d2bbcf2

Min Xu authored May 17, 2021



* [fix] auto_wrap: support wrapping based on wrapper_config

- user can use this to avoid assert if auto_wrap is used multiple times on a module
- user can traverse the modules multiple times and assign a wrapper_config
  to the module and then use auto_wrap once to wrap them

fix #649
fix #585

* added changelog

* fix tests

* fix a test

* added an optional assert for collision based on discussions with Quentin

* added config_auto_wrap_policy

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>

9d2bbcf2

[feat] Save FSDP metadata for offline unflattening + Consolidate checkpoints (#683) · 81c20f72

Quentin Duval authored May 17, 2021



* Save FSDP metadata for offline unflattening

* Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint

* Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint

* Add a unit test to show how to use the function

* Code review + improvement of the unit tests

* Code review: extract clean_path

* Make meta data and consolidation of checkpoint work for flatten_parameter=False

* Add new unit test file in CI

* Complete changelog and fix mypy issues

* Add support for module buffers in the consolidation of sharded checkpoints

* Better support for module buffers: save them in the meta data

* Refactoring: use a data-format for the meta data that is simpler to understand (move from object of array to array of object format)

* Renaming to make code clearer

* Code review: in_temporary_directory rework and typo correction

* Renaming
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: QuentinDuval <QuentinDuval@users.noreply.github.com>

81c20f72

14 May, 2021 1 commit
- [refactor] [fsdp] Modify FSDP API param name to better reflect functionality (#676) · 5be4817d
  anj-s authored May 14, 2021
```
* api changes

* fix list

* modify changelog

* modify changelog

* modify changelog

* move function
```
  5be4817d
13 May, 2021 1 commit

[fix] add and use get_process_group_cached (#678) · bde4bac5

Min Xu authored May 12, 2021

* [fix] add and use get_process_group_cached

- This commit makes FSDP avoid making too many process groups by default
- Extra process group is bad for GPU memory and init time

* add changelog

* lint

* note on speed

* add better assert output

test seems to be flaky:
https://app.circleci.com/pipelines/github/facebookresearch/fairscale/2957/workflows/383c9f9f-f1a5-461c-8c41-e2e28ece037b/jobs/26783/steps



* update test reference memory values

- With cached process groups, the memory is reduced as reported by
pytorch as well (due to bucket buffer memory for the reduction buffer)
- The effect on memory is actually more on the SMI memory, which is not
reported by pytorch and checked by this test.

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

* Update CHANGELOG.md

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* improved changelog

* better handling of underscores in the md file
Co-authored-by: Min Xu <min.xu@acm.org>

bde4bac5

12 May, 2021 1 commit

[chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2

anj-s authored May 12, 2021

* rename files

* add newly renamed file

* rename and move checkpoint activations related files

* add test files to ci list

* fix lint errors

* modify docs

* add changelog

* retain old path for now

* fix lint errors

* add another import test case

* fix merge conflict

* add missing test file

72c6bab2

11 May, 2021 1 commit

[fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3

Min Xu authored May 10, 2021



* [fix] FSDP forward pass overlap between compute and all-gather

- much thanks for @cyanguwa for report and @QuentinDuval for debugging it
- a new unit test is added to check for this and ensure we detect
  issue with overlapping and cpu/gpu blocking wait calls

* fix

* fix

* fix

* better assertion outputs

* fix format and tune all_gather mb for CI

* more tuning with non_flatten

* undo an accidental change

* tuning all gather mb and del model

* Update + fix overlapping test to use patched all_gather w/ delay (#672)

* fixing get_cycles_per_ms

* add get_smi_memory

* update the docstring
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Myle Ott <myleott@fb.com>

8a42a8e3

07 May, 2021 1 commit

[fix]: support pytorch SyncBatchNorm under AMP & checkpointing with FSDP (#659) · 6db68518

Min Xu authored May 07, 2021



* [test]: add a more general test case

- also rebalance the tests a bit

* added missing arg

* balance

* better checking

* balance

* make test smaller and faster

* make ddp results cached and enable sync_bn

* clean up

* fix tests

* changelog

* blance

* fix

* addressing comments
Co-authored-by: Min Xu <min.xu@acm.org>

6db68518

05 May, 2021 2 commits

[fix] better assert and better test for frozen weights (#657) · b54eed1b

Min Xu authored May 05, 2021



* [fix] better assert and better test for frozen weights

- the precise condition should have been check m.parameters(), not
  m.params.
- fixes #643

* add changelog

* use enum is so much better
Co-authored-by: Min Xu <min.xu@acm.org>

b54eed1b

[fix] add clear_autocast_cache flag (#650) · 861b5ce2

Min Xu authored May 04, 2021



* [fix] add clear_autocast_cache flag

- when training in AMP model with weight dtype32, FSDP may need to
  optionally clear the autocast cache to avoid GPU OOM
- this flag is default false, automatically doing it is a future TODO
- also added a verbose flag to make print(fsdp_model) a bit shorter
- updated the memory test to cover those new code
- added a couple of useful functions in parallel.py and testing.py

* minor

* address comments

* format

* improve the test
Co-authored-by: Min Xu <min.xu@acm.org>

861b5ce2

03 May, 2021 1 commit
- [fix] SDP: expose module property fix + unit test (#647) · 4e438ba1
  Benjamin Lefaudeux authored May 03, 2021
```
* fix + unit test
* changelog update
```
  4e438ba1
28 Apr, 2021 2 commits

[chore] do not build cuda extensions by default (#634) · 2bb2a134
msbaines authored Apr 27, 2021

2bb2a134

[feat] save memory by using bucket buffer only in backward (#633) · a5594032

Min Xu authored Apr 27, 2021



* [feat] save memory by using bucket buffer only in backward

- this fixes bug #627
- added documentation to clarify the buffer's cost and speed/memory
  tradeoff
- added setup/teardown calls so that the buffer is only allocated
  during the backward pass, saving more memory for forward and stepping
  so that they can be used for things like activations.
- added a unit test that assert the memory is in range.

Comparing with DDP:

  1. buffer size scales with # of FSDP not model size
  2. buffer is only allocated during backward
  3. buffer is used for small tensors only to reduce overhead
  4. overlapping of compute-reduction is very different

* add PR number to changelog

* filled in with memory number on 1.9

* addressed comments

* update comments

* fix for 1.6

* add a todo
Co-authored-by: Min Xu <min.xu@acm.org>

a5594032