Commits · 72c6bab24f398dbc583a26508dd9ee1f3dbc4fc2 · OpenDAS / fairscale

12 May, 2021 1 commit

[chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2

anj-s authored May 12, 2021

* rename files

* add newly renamed file

* rename and move checkpoint activations related files

* add test files to ci list

* fix lint errors

* modify docs

* add changelog

* retain old path for now

* fix lint errors

* add another import test case

* fix merge conflict

* add missing test file

72c6bab2

11 May, 2021 1 commit

[fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3

Min Xu authored May 10, 2021



* [fix] FSDP forward pass overlap between compute and all-gather

- much thanks for @cyanguwa for report and @QuentinDuval for debugging it
- a new unit test is added to check for this and ensure we detect
  issue with overlapping and cpu/gpu blocking wait calls

* fix

* fix

* fix

* better assertion outputs

* fix format and tune all_gather mb for CI

* more tuning with non_flatten

* undo an accidental change

* tuning all gather mb and del model

* Update + fix overlapping test to use patched all_gather w/ delay (#672)

* fixing get_cycles_per_ms

* add get_smi_memory

* update the docstring
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Myle Ott <myleott@fb.com>

8a42a8e3

07 May, 2021 1 commit

[fix]: support pytorch SyncBatchNorm under AMP & checkpointing with FSDP (#659) · 6db68518

Min Xu authored May 07, 2021



* [test]: add a more general test case

- also rebalance the tests a bit

* added missing arg

* balance

* better checking

* balance

* make test smaller and faster

* make ddp results cached and enable sync_bn

* clean up

* fix tests

* changelog

* blance

* fix

* addressing comments
Co-authored-by: Min Xu <min.xu@acm.org>

6db68518

05 May, 2021 2 commits

[fix] better assert and better test for frozen weights (#657) · b54eed1b

Min Xu authored May 05, 2021



* [fix] better assert and better test for frozen weights

- the precise condition should have been check m.parameters(), not
  m.params.
- fixes #643

* add changelog

* use enum is so much better
Co-authored-by: Min Xu <min.xu@acm.org>

b54eed1b

[fix] add clear_autocast_cache flag (#650) · 861b5ce2

Min Xu authored May 04, 2021



* [fix] add clear_autocast_cache flag

- when training in AMP model with weight dtype32, FSDP may need to
  optionally clear the autocast cache to avoid GPU OOM
- this flag is default false, automatically doing it is a future TODO
- also added a verbose flag to make print(fsdp_model) a bit shorter
- updated the memory test to cover those new code
- added a couple of useful functions in parallel.py and testing.py

* minor

* address comments

* format

* improve the test
Co-authored-by: Min Xu <min.xu@acm.org>

861b5ce2

03 May, 2021 1 commit
- [fix] SDP: expose module property fix + unit test (#647) · 4e438ba1
  Benjamin Lefaudeux authored May 03, 2021
```
* fix + unit test
* changelog update
```
  4e438ba1
28 Apr, 2021 2 commits

[chore] do not build cuda extensions by default (#634) · 2bb2a134
msbaines authored Apr 27, 2021

2bb2a134

[feat] save memory by using bucket buffer only in backward (#633) · a5594032

Min Xu authored Apr 27, 2021



* [feat] save memory by using bucket buffer only in backward

- this fixes bug #627
- added documentation to clarify the buffer's cost and speed/memory
  tradeoff
- added setup/teardown calls so that the buffer is only allocated
  during the backward pass, saving more memory for forward and stepping
  so that they can be used for things like activations.
- added a unit test that assert the memory is in range.

Comparing with DDP:

  1. buffer size scales with # of FSDP not model size
  2. buffer is only allocated during backward
  3. buffer is used for small tensors only to reduce overhead
  4. overlapping of compute-reduction is very different

* add PR number to changelog

* filled in with memory number on 1.9

* addressed comments

* update comments

* fix for 1.6

* add a todo
Co-authored-by: Min Xu <min.xu@acm.org>

a5594032

26 Apr, 2021 1 commit

[chore] 0.3.6 release (#631) · 36da9d6e

Min Xu authored Apr 26, 2021



* [chore] 0.3.6 release

* try redo the caches
Co-authored-by: Min Xu <min.xu@acm.org>

36da9d6e

19 Apr, 2021 1 commit

[chore] 0.3.5 release (#616) · 1141528e

Min Xu authored Apr 19, 2021



* [chore] 0.3.5 release

* address comment
Co-authored-by: Min Xu <min.xu@acm.org>

1141528e

13 Apr, 2021 1 commit
- [chore] v0.3.4 (#603) · 82d6997c
  Benjamin Lefaudeux authored Apr 13, 2021
  
  82d6997c
02 Apr, 2021 1 commit
- [chore] 0.3.3 release (#568) · 60694da1
  Min Xu authored Apr 02, 2021
```
- releasing 0.3.3
- I need it in vissl for the auto_wrap_bn change
```
  60694da1
18 Mar, 2021 3 commits

[chore] 0.3.2 release (#535) · 9a37498c
Min Xu authored Mar 18, 2021

9a37498c

[feat] FSDP: add auto_wrap_bn (#531) · 8b59267b

Min Xu authored Mar 18, 2021

* [feat] FSDP: add auto_wrap_bn

- add an utility function to handle wrapping of BN

* changelog

8b59267b

[feature] FSDP: enable pytorch SyncBN (#527) · 2fc1f6d8

Min Xu authored Mar 17, 2021

* [feature] FSDP: enable pytorch SyncBN

- not fully validated yet but at least not asserting
- this enables VISSL to move forward with its next PR

* add the test file

* changelog and lint

* addressed comment

2fc1f6d8

12 Mar, 2021 1 commit

[fix] FSDP: multi-pass autograd graph and mixed precision (#513) · 82986ca0

Min Xu authored Mar 12, 2021



* FSDP: multi-pass autograd graph and mixed precision

- added BACKWARD_PRE/POST checking
- better assert_state
- fixed issue of backward hook misfiring

* fix

* cleanup

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Myle Ott <myleott@fb.com>
Co-authored-by: Myle Ott <myleott@fb.com>

82986ca0

11 Mar, 2021 1 commit

[fix][OSS] Adding a hard sync stream barrier before broadcast (#512) · c9fdf506

Benjamin Lefaudeux authored Mar 11, 2021

* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene
* adding a proper unit test
* adding a unit test for https://github.com/facebookresearch/fairscale/pull/510

c9fdf506

09 Mar, 2021 1 commit

[chore] 0.3.1 release (#504) · 84cec202

Min Xu authored Mar 09, 2021



* [chore] 0.3.1 release

- mainly because vissl needs the new version
- added a doc on release steps

* Update CHANGELOG.md
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

* review comments
Co-authored-by: anj-s <32556631+anj-s@users.noreply.github.com>

84cec202

25 Feb, 2021 1 commit
- [ShardedDDP][Minor] Backport a bucket flush fix from FSDP, may help a few existing users (#435) · 7ee228bf
  Benjamin Lefaudeux authored Feb 25, 2021
```
* bring back a fix from FSDP, may help a few existing users
```
  7ee228bf
23 Feb, 2021 6 commits
- [chore] v0.3.0 (#416) · d64ff250
  Benjamin Lefaudeux authored Feb 22, 2021
```
* v0.3.0 it is, celebration time
```
  d64ff250
- [perf][ShardedDDP] fp16 gradient reduce (#411) · d52d2186
  Benjamin Lefaudeux authored Feb 22, 2021
```
* POC, testing against the DDP comm hook when available
* docs, adding a reference to DDP's compress hook
* updating changelog, prep for v0.1.8 release
```
  d52d2186
- [docs] minor changelog update · 4f2eb1ad
  Min Xu authored Feb 22, 2021
  
  4f2eb1ad
- [doc] minor formatting of changelog · b6934bf5
  Min Xu authored Feb 22, 2021
  
  b6934bf5
- [bug]: not all CUDA memory is freed when model is deleted (#412) · e3035933
  Min Xu authored Feb 22, 2021
```
* [bug]: not all CUDA memory is freed when model is deleted

* fixed memory leak

- without this, peak memory will be high when more than one model
  is trained (i.e. first model leave staff around pushing up the
  peak memory when the second model runs)

* addressed comments

* fix

* changelog
```
  e3035933
- [docs] fsdp changelog and doc (#414) · 2b15720b
  Min Xu authored Feb 22, 2021
  
  2b15720b
22 Feb, 2021 1 commit
- [fix][OSS] adding an assert for empty shards + corresponding unit test (#406) · 279b8024
  Benjamin Lefaudeux authored Feb 22, 2021
```
* adding an assert + corresponding unit test
* updated changelog
* adjusting the adascale tests
```
  279b8024
19 Feb, 2021 1 commit
- [chore] v0.1.7 (#404) · a606e84b
  Benjamin Lefaudeux authored Feb 19, 2021
```
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
```
  a606e84b
18 Feb, 2021 1 commit
- [fix][minor] ShardedDDP train/eval modes (#393) · ef7146d5
  Benjamin Lefaudeux authored Feb 18, 2021
```
* [fix] ShardedDDP train/eval modes
* Update CHANGELOG.md
```
  ef7146d5
17 Feb, 2021 1 commit
- [feat][ShardedDDP] manual reduce option (#389) · 47042917
  Benjamin Lefaudeux authored Feb 16, 2021
```
* initial implementation, with unit test and assert
* added changelog and better debug string
```
  47042917
12 Feb, 2021 1 commit

[feature-fix-refactor][ShardedDDP] Make it possible to change trainability graph on the fly (#369) · 13445c55

Benjamin Lefaudeux authored Feb 11, 2021

* Better unit testing
* Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time
* Enabling accumulation tests

13445c55

11 Feb, 2021 1 commit
- [chore] v0.1.6 (#377) · ce9e7e48
  Benjamin Lefaudeux authored Feb 10, 2021
```
* v0.1.6
```
  ce9e7e48
03 Feb, 2021 1 commit
- [chore] v0.1.5 (#355) · 4401ced9
  Benjamin Lefaudeux authored Feb 03, 2021
  
  4401ced9
02 Feb, 2021 1 commit

[feat][OSS] elastic and pytorch compatible checkpoints (#310) · 9e8929e6

Benjamin Lefaudeux authored Feb 02, 2021

* adding a test to prove the inter operability with upstream pytorch
* updating the changelog
* eager state pruning
* pytorch 1.5 compat

9e8929e6

29 Jan, 2021 1 commit
- [ShardedDDP] Bucketing reduce calls, tensor views (#327) · 51625eda
  Benjamin Lefaudeux authored Jan 28, 2021
  
  51625eda
07 Jan, 2021 1 commit
- [fix] Adding missing CUDA files in the pip package v0.1.4 (#295) · 53a912c3
  Benjamin Lefaudeux authored Jan 07, 2021
```
* trying to fix the missing files in the pip package (not in this diff)
* adding a long description, more pypi friendly
```
  53a912c3
05 Jan, 2021 1 commit
- [chore] creating 0.1.3 to align numbering everywhere (#289) · 7cc8b34a
  Benjamin Lefaudeux authored Jan 04, 2021
```
release pip package to follow suit
```
  7cc8b34a
04 Jan, 2021 2 commits

[chore] 0.1.2 version bump (#285) · a21f50f9
Benjamin Lefaudeux authored Jan 04, 2021

a21f50f9

[feat] sync adascale from internal repo, support add_param_group (#266) · 3932a1f6

Min Xu authored Jan 04, 2021

* [feat] sync adascale from internal repo

- tbd

testing: tbd

* Update argument document of __init__

* update documentation around set_num_gradients_to_accumulate

* added checking code for proper API calling places

* rename internal APIs to make them internal

* updated changelog

* added support for add_param_group and its unit test

* added unit test for set_num_gradients_to_accumulate

* added debias_ewma unit test

* fixed test_set_num_gradients_to_accumulate (need zero_grad() call)

* added missing zero_grad() to test_lr_scheduler

* fixed test_add_param_group with respect to optim.zero_grad()

* added test_gradient_value

* added test_scale_not_equal_default for scale != world_size * grad_accum

* added test_unhook()

* removed print statements

* fixed a typo

* addressed Ben's comment

3932a1f6

30 Dec, 2020 1 commit

[fix] Dead code removal for OSS (#276) · fb8d9137

Benjamin Lefaudeux authored Dec 29, 2020

* removing a dead call since ShardedDDP, small speedup
* unrelated, but filling in the changelog
* another nit

fb8d9137

24 Dec, 2020 1 commit

[chore] Update changelog (#268) · 18455bf0

Min Xu authored Dec 23, 2020

* Update changelog

missed this item from previous AdaScale commit.

* More change log

* Addressed review comments

18455bf0