Commits · ce2ad89ee9b7f390d36eda201c0890a6e6d3acf8 · OpenDAS / fairscale

20 Oct, 2021 1 commit
- [chore] Add log for the new experimental memory tracker feature. (#819) · ce2ad89e
  anj-s authored Oct 20, 2021
```
* add log for new memory tracker features

* add log for new memory tracker features
```
  ce2ad89e
20 Sep, 2021 1 commit
- [chore]0.4.1 release (#803) · 1b9be421
  tmarkstrum authored Sep 20, 2021
```
* [chore]0.4.1 release

* put more details in one change log
```
  1b9be421
13 Sep, 2021 1 commit
- [OSS] Fixing the fp16 broadcast and catching this case in the unit test (#795) · 180ab8c8
  Benjamin Lefaudeux authored Sep 13, 2021
  
  180ab8c8
12 Sep, 2021 1 commit

[fix] minor fixes for master branch (#792) · 31e36453

Min Xu authored Sep 12, 2021



* add changelog for previous commit

* add changelog for previous commit

* add changelog for previous commit

* fix a merge induced error
Co-authored-by: Min Xu <min.xu.public@gmail.com>

31e36453

05 Sep, 2021 1 commit

[fix] [FSDP] making sure we use full params for multiple backwards within an iteration (#775) · 95d31d4d

Min Xu authored Sep 05, 2021



* [bug] [FSDP] making sure we use full params for multiple backwards within an iteration

* changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

95d31d4d

12 Aug, 2021 2 commits
- add changelog for PRs submitted (#764) · d54e183c
  anj-s authored Aug 12, 2021
  
  d54e183c
- [minor] RELEASE.md and pre-commit (#762) · f2852ad7
  Min Xu authored Aug 12, 2021
```
* minor: changelog and pre-commit

* addressed comment

* update the release doc
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  f2852ad7
01 Aug, 2021 1 commit
- [chore] 0.4.0 release (#757) · 3e661603
  Min Xu authored Jul 31, 2021
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  3e661603
31 Jul, 2021 1 commit

FSDP: supporting gradient accumulation without no_sync context manager to save GPU memory (#752) · cd0f0b88

Myle Ott authored Jul 31, 2021



* Add test (broken) for gradient accumulation without no_sync context manager

* changelog

* no_sync to grad_acc renaming for tests

* clean up tmp files

* support grad acc without no_sync

* minor

* update changelog

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Better assertion from Sam.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

cd0f0b88

27 Jul, 2021 2 commits
- [chore] 0.3.9 release (#750) · 61ece000
  Min Xu authored Jul 27, 2021
```
* [chore] 0.3.9 release

* update changelog

* address comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  61ece000
- [fix] OSS fp16 broadcast typo (#751) · b46dcfaf
  Benjamin Lefaudeux authored Jul 27, 2021
  
  b46dcfaf
26 Jul, 2021 1 commit

[feat]: prepare FSDP to handle multiple flatten params and fixed metadata saving for MoE (#746) · 83b0b49e

Min Xu authored Jul 26, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 3: make FSDP use FlattenParamModule unconditionally

* fixing the auto_wrap tests

* minor

* rewrite local_metadata_dict

- updated FPW so that custom flat param name is also supported

* bug fix

* mypy

* rewrote consolidate_shard_weights

- test_consolidate passes

* comments

* fixing pickling

* Fix shared params and MoE logic (#749)

* add strict kwarg to support fairseq:gshard MoE saving logic

* Test fairseq style shard

* style

* formatting and address comments

* added changelog

* fixing a test after padding renaming
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

83b0b49e

12 Jul, 2021 1 commit
- [chore] 0.3.8 release (#739) · 782714a8
  anj-s authored Jul 12, 2021
  
  782714a8
21 Jun, 2021 1 commit

[feat] FSDP: supporting multiple flatten parameter groups (#711) · ab71efb3

Min Xu authored Jun 21, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 2: extending FPW to support multiple flat params groups
- FSDP still only use one group
- unit test does this the new code paths
- updated the changelog

* first cut, mypy passed

* test_flatten_params_wrapper.py::TestFlattenParams tests pass

* added two more test cases and fixed a case in the code

* fixed one bug with param_path_infos

* fixed two more tests with hardcoded flat_param names

* Update CHANGELOG.md
Co-authored-by: Min Xu <min.xu.public@gmail.com>

ab71efb3

11 Jun, 2021 1 commit

Use original forward pass directly when in eval mode from within checkpoint wrapper (#709) · 370b8483

Pete authored Jun 10, 2021

* add failing test

* add fix

* use 'torch.is_grad_enabled()' instead of 'module.training'

* Revert "add failing test"

This reverts commit 1c34242208f9b2c5fa6c8f181434c2be6d7cdbc0.

* add simple test

* improve test

* add check for fwd_counter

* revert typing/format changes

* move to new test file

* CHANGELOG

* remove old test

* fix import order

* fix test to be compat with torch 1.6.0

* clean up

* comments

* isort 🤦

370b8483

01 Jun, 2021 1 commit
- Fix buffer dtype in ` FSDP.state_dict()` when using mixed precision (#705) · 25cebf85
  Pete authored Jun 01, 2021
```
* add failing test for buffer dtype

* fix buffer dtype issue

* update CHANGELOG

* fix
```
  25cebf85
28 May, 2021 1 commit

[fix] using dummy tensor to ensure checkpoint backward pass is called in corner cases (#701) · df7db85c

Min Xu authored May 28, 2021



* [do not merge] testing a corner case

* workaround

* using dummy tensor to fix

* lint

* changelog

* update a comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

df7db85c

18 May, 2021 1 commit

[chore] 0.3.7 release (#686) · a462df2e

Min Xu authored May 17, 2021



* [chore] 0.3.7 release

* fixed changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

a462df2e

17 May, 2021 2 commits

[fix] auto_wrap: support wrapping based on wrapper_config (#685) · 9d2bbcf2

Min Xu authored May 17, 2021



* [fix] auto_wrap: support wrapping based on wrapper_config

- user can use this to avoid assert if auto_wrap is used multiple times on a module
- user can traverse the modules multiple times and assign a wrapper_config
  to the module and then use auto_wrap once to wrap them

fix #649
fix #585

* added changelog

* fix tests

* fix a test

* added an optional assert for collision based on discussions with Quentin

* added config_auto_wrap_policy

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>

9d2bbcf2

[feat] Save FSDP metadata for offline unflattening + Consolidate checkpoints (#683) · 81c20f72

Quentin Duval authored May 17, 2021



* Save FSDP metadata for offline unflattening

* Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint

* Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint

* Add a unit test to show how to use the function

* Code review + improvement of the unit tests

* Code review: extract clean_path

* Make meta data and consolidation of checkpoint work for flatten_parameter=False

* Add new unit test file in CI

* Complete changelog and fix mypy issues

* Add support for module buffers in the consolidation of sharded checkpoints

* Better support for module buffers: save them in the meta data

* Refactoring: use a data-format for the meta data that is simpler to understand (move from object of array to array of object format)

* Renaming to make code clearer

* Code review: in_temporary_directory rework and typo correction

* Renaming
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: QuentinDuval <QuentinDuval@users.noreply.github.com>

81c20f72

14 May, 2021 1 commit
- [refactor] [fsdp] Modify FSDP API param name to better reflect functionality (#676) · 5be4817d
  anj-s authored May 14, 2021
```
* api changes

* fix list

* modify changelog

* modify changelog

* modify changelog

* move function
```
  5be4817d
13 May, 2021 1 commit

[fix] add and use get_process_group_cached (#678) · bde4bac5

Min Xu authored May 12, 2021

* [fix] add and use get_process_group_cached

- This commit makes FSDP avoid making too many process groups by default
- Extra process group is bad for GPU memory and init time

* add changelog

* lint

* note on speed

* add better assert output

test seems to be flaky:
https://app.circleci.com/pipelines/github/facebookresearch/fairscale/2957/workflows/383c9f9f-f1a5-461c-8c41-e2e28ece037b/jobs/26783/steps



* update test reference memory values

- With cached process groups, the memory is reduced as reported by
pytorch as well (due to bucket buffer memory for the reduction buffer)
- The effect on memory is actually more on the SMI memory, which is not
reported by pytorch and checked by this test.

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

* Update CHANGELOG.md

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* improved changelog

* better handling of underscores in the md file
Co-authored-by: Min Xu <min.xu@acm.org>

bde4bac5

12 May, 2021 1 commit

[chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2

anj-s authored May 12, 2021

* rename files

* add newly renamed file

* rename and move checkpoint activations related files

* add test files to ci list

* fix lint errors

* modify docs

* add changelog

* retain old path for now

* fix lint errors

* add another import test case

* fix merge conflict

* add missing test file

72c6bab2

11 May, 2021 1 commit

[fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3

Min Xu authored May 10, 2021



* [fix] FSDP forward pass overlap between compute and all-gather

- much thanks for @cyanguwa for report and @QuentinDuval for debugging it
- a new unit test is added to check for this and ensure we detect
  issue with overlapping and cpu/gpu blocking wait calls

* fix

* fix

* fix

* better assertion outputs

* fix format and tune all_gather mb for CI

* more tuning with non_flatten

* undo an accidental change

* tuning all gather mb and del model

* Update + fix overlapping test to use patched all_gather w/ delay (#672)

* fixing get_cycles_per_ms

* add get_smi_memory

* update the docstring
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Myle Ott <myleott@fb.com>

8a42a8e3

07 May, 2021 1 commit

[fix]: support pytorch SyncBatchNorm under AMP & checkpointing with FSDP (#659) · 6db68518

Min Xu authored May 07, 2021



* [test]: add a more general test case

- also rebalance the tests a bit

* added missing arg

* balance

* better checking

* balance

* make test smaller and faster

* make ddp results cached and enable sync_bn

* clean up

* fix tests

* changelog

* blance

* fix

* addressing comments
Co-authored-by: Min Xu <min.xu@acm.org>

6db68518

05 May, 2021 2 commits

[fix] better assert and better test for frozen weights (#657) · b54eed1b

Min Xu authored May 05, 2021



* [fix] better assert and better test for frozen weights

- the precise condition should have been check m.parameters(), not
  m.params.
- fixes #643

* add changelog

* use enum is so much better
Co-authored-by: Min Xu <min.xu@acm.org>

b54eed1b

[fix] add clear_autocast_cache flag (#650) · 861b5ce2

Min Xu authored May 04, 2021



* [fix] add clear_autocast_cache flag

- when training in AMP model with weight dtype32, FSDP may need to
  optionally clear the autocast cache to avoid GPU OOM
- this flag is default false, automatically doing it is a future TODO
- also added a verbose flag to make print(fsdp_model) a bit shorter
- updated the memory test to cover those new code
- added a couple of useful functions in parallel.py and testing.py

* minor

* address comments

* format

* improve the test
Co-authored-by: Min Xu <min.xu@acm.org>

861b5ce2

03 May, 2021 1 commit
- [fix] SDP: expose module property fix + unit test (#647) · 4e438ba1
  Benjamin Lefaudeux authored May 03, 2021
```
* fix + unit test
* changelog update
```
  4e438ba1
28 Apr, 2021 2 commits

[chore] do not build cuda extensions by default (#634) · 2bb2a134
msbaines authored Apr 27, 2021

2bb2a134

[feat] save memory by using bucket buffer only in backward (#633) · a5594032

Min Xu authored Apr 27, 2021



* [feat] save memory by using bucket buffer only in backward

- this fixes bug #627
- added documentation to clarify the buffer's cost and speed/memory
  tradeoff
- added setup/teardown calls so that the buffer is only allocated
  during the backward pass, saving more memory for forward and stepping
  so that they can be used for things like activations.
- added a unit test that assert the memory is in range.

Comparing with DDP:

  1. buffer size scales with # of FSDP not model size
  2. buffer is only allocated during backward
  3. buffer is used for small tensors only to reduce overhead
  4. overlapping of compute-reduction is very different

* add PR number to changelog

* filled in with memory number on 1.9

* addressed comments

* update comments

* fix for 1.6

* add a todo
Co-authored-by: Min Xu <min.xu@acm.org>

a5594032

26 Apr, 2021 1 commit

[chore] 0.3.6 release (#631) · 36da9d6e

Min Xu authored Apr 26, 2021



* [chore] 0.3.6 release

* try redo the caches
Co-authored-by: Min Xu <min.xu@acm.org>

36da9d6e

19 Apr, 2021 1 commit

[chore] 0.3.5 release (#616) · 1141528e

Min Xu authored Apr 19, 2021



* [chore] 0.3.5 release

* address comment
Co-authored-by: Min Xu <min.xu@acm.org>

1141528e

13 Apr, 2021 1 commit
- [chore] v0.3.4 (#603) · 82d6997c
  Benjamin Lefaudeux authored Apr 13, 2021
  
  82d6997c
02 Apr, 2021 1 commit
- [chore] 0.3.3 release (#568) · 60694da1
  Min Xu authored Apr 02, 2021
```
- releasing 0.3.3
- I need it in vissl for the auto_wrap_bn change
```
  60694da1
18 Mar, 2021 3 commits

[chore] 0.3.2 release (#535) · 9a37498c
Min Xu authored Mar 18, 2021

9a37498c

[feat] FSDP: add auto_wrap_bn (#531) · 8b59267b

Min Xu authored Mar 18, 2021

* [feat] FSDP: add auto_wrap_bn

- add an utility function to handle wrapping of BN

* changelog

8b59267b

[feature] FSDP: enable pytorch SyncBN (#527) · 2fc1f6d8

Min Xu authored Mar 17, 2021

* [feature] FSDP: enable pytorch SyncBN

- not fully validated yet but at least not asserting
- this enables VISSL to move forward with its next PR

* add the test file

* changelog and lint

* addressed comment

2fc1f6d8

12 Mar, 2021 1 commit

[fix] FSDP: multi-pass autograd graph and mixed precision (#513) · 82986ca0

Min Xu authored Mar 12, 2021



* FSDP: multi-pass autograd graph and mixed precision

- added BACKWARD_PRE/POST checking
- better assert_state
- fixed issue of backward hook misfiring

* fix

* cleanup

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Myle Ott <myleott@fb.com>
Co-authored-by: Myle Ott <myleott@fb.com>

82986ca0

11 Mar, 2021 1 commit

[fix][OSS] Adding a hard sync stream barrier before broadcast (#512) · c9fdf506

Benjamin Lefaudeux authored Mar 11, 2021

* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene
* adding a proper unit test
* adding a unit test for https://github.com/facebookresearch/fairscale/pull/510

c9fdf506

09 Mar, 2021 1 commit

[chore] 0.3.1 release (#504) · 84cec202

Min Xu authored Mar 09, 2021



* [chore] 0.3.1 release

- mainly because vissl needs the new version
- added a doc on release steps

* Update CHANGELOG.md
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* review comments
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

84cec202