Commits · 8c8a625a828ae5a222b30cb1256af0da7ecfee26 · OpenDAS / fairscale

29 Apr, 2021 1 commit
- [test][minor] Improving SDP test coverage (#639) · 8c8a625a
  Benjamin Lefaudeux authored Apr 29, 2021
```
* Improving test coverage on SDP
* using pytest exception catcher
```
  8c8a625a
28 Apr, 2021 2 commits

[test] improve BN test coverage (#638) · 21cba91b

Min Xu authored Apr 28, 2021



* [test] improve BN test coverage

- Added sync_bn on/off cases
- Added conv and linear bias on/off cases
- clarified when sync_bn is off, when is BN wrapping needed with the test

* adding a comment
Co-authored-by: Min Xu <min.xu@acm.org>

21cba91b

[feat] save memory by using bucket buffer only in backward (#633) · a5594032

Min Xu authored Apr 27, 2021



* [feat] save memory by using bucket buffer only in backward

- this fixes bug #627
- added documentation to clarify the buffer's cost and speed/memory
  tradeoff
- added setup/teardown calls so that the buffer is only allocated
  during the backward pass, saving more memory for forward and stepping
  so that they can be used for things like activations.
- added a unit test that assert the memory is in range.

Comparing with DDP:

  1. buffer size scales with # of FSDP not model size
  2. buffer is only allocated during backward
  3. buffer is used for small tensors only to reduce overhead
  4. overlapping of compute-reduction is very different

* add PR number to changelog

* filled in with memory number on 1.9

* addressed comments

* update comments

* fix for 1.6

* add a todo
Co-authored-by: Min Xu <min.xu@acm.org>

a5594032

26 Apr, 2021 1 commit

[fix]: let FSDP handle model with multiple forward pass and checkpoint (#621) · a1612d79

Min Xu authored Apr 26, 2021



* [fix]: let FSDP handle model with multiple forward pass and checkpoint

* try CI again

* save

* save

* fixed case with bn

* minor

* add the new file

* minor

* added test of a single case, runtime is about 50s

* enable all 8 test cases

* cleanup

* cleanup

* skip flatten case with 1.6 and 1.7

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

a1612d79

23 Apr, 2021 1 commit

[FSDP] relax checking root condition (#620) · d3b86d65

shuyingsunshine21 authored Apr 22, 2021

* relax checking root condition

* formatting

* add unittest

* add unittest to ci test list

* isort for import of unittest

* format black .

* move test to list 1

* add skip no cuda

* black and isort

d3b86d65

22 Apr, 2021 2 commits

[fix] mypy and flaky test (#624) · 961df76e

Min Xu authored Apr 22, 2021



* [fix] mypy and flaky test

- CI didn't seem to catch this or maybe I merged incorrectly yesterday
- this should fix the mypy error on master
- also updated a test that seems to be flaky due to tcp port conflict

* another flaky test, hopefully more determinism helps

* CR

* skip 1.6

* fix

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

961df76e

[SDP] removing an assert which does not seem always accurate (#625) · 85962b97
Benjamin Lefaudeux authored Apr 22, 2021

85962b97

19 Apr, 2021 1 commit

FSDP: fixing training with freezing weights (#614) · 24da3b11

Min Xu authored Apr 18, 2021



* FSDP: fixing training with freezing weights

- an assert is changed to catch this case correctly
- unit test added (based on Quentin's test code) for this case and
  compare DDP and FSDP

fixes: #610

* added test file to list 1

* Use better and simpler code as suggested by Myle

* testing both methods of freezing as well
Co-authored-by: Min Xu <min.xu@acm.org>

24da3b11

13 Apr, 2021 3 commits
- [FSDP] use all_gather for 10X OSD consolidation speedup (#595) · a82825db
  Sam Shleifer authored Apr 13, 2021
  
  a82825db
- replacing multip-process pipe implementation with more flexible one (#567) · 4726d5be
  Mehdi Mirzazadeh authored Apr 13, 2021
```
replacing multip-process pipe implementation with more flexible one

Initial implementation of proposal pytorch/pytorch#55256
```
  4726d5be
- [SDP] Adding a unit test which checks for multiple FW passes on the same block (#596) · b191fe5f
  Benjamin Lefaudeux authored Apr 12, 2021
```
* Adding a unit test which checks for multiple FW passes on the same block
* Adding an embedding table, but still no problem to show for it
```
  b191fe5f
08 Apr, 2021 1 commit
- [fix] [FSDP] optim state dict should be completely on CPU (#590) · a6549be7
  Sam Shleifer authored Apr 08, 2021
  
  a6549be7
07 Apr, 2021 2 commits
- [fix][ShardedDDP] Properly handle .eval() mode (#587) · ce1f2cea
  Benjamin Lefaudeux authored Apr 07, 2021
```
* Properly handle .train() and .eval() modes
* showing that the unit test works, now fixed
* code review
```
  ce1f2cea
- [FSDP] [feat] Add state_dict_device option (#579) · 14abed6e
  Myle Ott authored Apr 07, 2021
  
  14abed6e
06 Apr, 2021 1 commit
- [fix][OSS] two small hotfixes.. repro not obvious for grad_fn (#583) · 121b9db0
  Benjamin Lefaudeux authored Apr 06, 2021
  
  121b9db0
04 Apr, 2021 2 commits
- [FSDP] add no_broadcast_optim_state option (#560) · 1fcbd624
  Sam Shleifer authored Apr 04, 2021
  
  1fcbd624
- [test] disable test which has started to become flaky (#575) · 54a97ee5
  msbaines authored Apr 04, 2021
```
This test is flaky for torch >= 1.8.0.
```
  54a97ee5
02 Apr, 2021 1 commit
- [test] modify MOE tests to use NCCL (#570) · 5a3df0da
  msbaines authored Apr 02, 2021
```
NCCL all_to_all is now supported in PyTorch (since v1.8.0)

Fixes: #548
```
  5a3df0da
01 Apr, 2021 1 commit
- [feat] remove old MultiProcessPipe (#563) · 2d3d5a7b
  msbaines authored Apr 01, 2021
  
  2d3d5a7b
31 Mar, 2021 2 commits

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98

Min Xu authored Mar 31, 2021

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)

* [fix] disable single rank process group for auto_wrap_bn

- beefed up unit test with regnet-like model
- found that single-rank process group is causing problem
- disabled it to enable convergence tests on the vissl side
- use `raise e from None` to get a better assertion output
  in testing.py.

* [test] fix regnet test for ddp+mixed_precision

- need AMP context in FSDP
- workaround different between ddp & fsdp when bias=True
- fixed a bug in input data generation that caused different ranks have
  the same data with wrong iteration count.
- added TODO for need a better loss and grad_scaler and reduced
  iters so there is no nan.
- added a (disabled) debugging code

* lint

* lint

* add scaler

* lint

* scaler

* add a real loss

* seeding in the ranks

* blance tests

* run AMP DDP==FSDP test only on cuda version 11 and up

* add relu inplace and comment

* make wrap_bn covers more cases in full precision mode

a0458b98

[chore] add testing of torch 1.9.0 nightly build (#559) · acb9ef00
msbaines authored Mar 31, 2021

acb9ef00

30 Mar, 2021 1 commit

[feat][fix] ShardedDDP deferred init (#558) · daa1bad5

Benjamin Lefaudeux authored Mar 30, 2021

* survive the model being moved to device post-construction
* make sure that a unit test would catch a regression

daa1bad5

26 Mar, 2021 1 commit

[test] FSDP: check with ddp parity with conv + bn (#549) · 0233efca

Min Xu authored Mar 26, 2021

- added DDP equivalency test
- added rmf, state_dict_norm functions to testing utils
- added more debugging output to objects_are_equal

0233efca

25 Mar, 2021 2 commits
- [chore][fix] SDP: yet another unit test improvement + bugfixes (#546) · ece0cbf9
  Benjamin Lefaudeux authored Mar 25, 2021
```
* re-activating unit test
* removing changed that slipped in
```
  ece0cbf9
- [FSDP][feature] optimizer state dict save and load (#537) · 9474d75d
  Sam Shleifer authored Mar 25, 2021
```
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
```
  9474d75d
22 Mar, 2021 1 commit
- [ci][SDP] extending the test matrix which checks for equivalence with DDP (#542) · df493a29
  Benjamin Lefaudeux authored Mar 22, 2021
  
  df493a29
20 Mar, 2021 1 commit

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) (#543) · fa1b85fb

Myle Ott authored Mar 20, 2021

* Add new test for weight init (fails)
* Set FSDP.compute_device so summon_full_params works before module moves to CUDA
* Override FSDP.apply to enable custom weight init

fa1b85fb

19 Mar, 2021 2 commits
- [feat][refactor][OSS] Param buckets + fp16 broadcasts (#540) · e3865549
  Benjamin Lefaudeux authored Mar 19, 2021
```
* param buckets
* unifying the buckets
```
  e3865549
- [test] use workaround to enable rpc tests when cuda not available (#541) · 195d62f1
  msbaines authored Mar 19, 2021
  
  195d62f1
18 Mar, 2021 4 commits
- [refactor][fix][SDP] Extract the grad buckets in a dedicated class, fix the resize_ bug (#532) · a1bdc7d3
  Benjamin Lefaudeux authored Mar 18, 2021
```
* extracting the buckets in a dedicated class, fixing the resize_ bug
* adding a unit test
* copyright
```
  a1bdc7d3
- [feat] FSDP: add auto_wrap_bn (#531) · 8b59267b
  Min Xu authored Mar 18, 2021
```
* [feat] FSDP: add auto_wrap_bn

- add an utility function to handle wrapping of BN

* changelog
```
  8b59267b
- [feature] FSDP: enable pytorch SyncBN (#527) · 2fc1f6d8
  Min Xu authored Mar 17, 2021
```
* [feature] FSDP: enable pytorch SyncBN

- not fully validated yet but at least not asserting
- this enables VISSL to move forward with its next PR

* add the test file

* changelog and lint

* addressed comment
```
  2fc1f6d8
- [refactor] removing duplicated tests (#529) · 98223763
  Benjamin Lefaudeux authored Mar 17, 2021
  
  98223763
17 Mar, 2021 1 commit
- [fix][SDP] Lightning-compat: deactivating buckets for a single rank, not useful (#514) · d3bfcbf5
  Benjamin Lefaudeux authored Mar 17, 2021
```
* Deactivating buckets for a single rank, not crashing but not useful
```
  d3bfcbf5
12 Mar, 2021 2 commits

[fix] FSDP: multi-pass autograd graph and mixed precision (#513) · 82986ca0

Min Xu authored Mar 12, 2021



* FSDP: multi-pass autograd graph and mixed precision

- added BACKWARD_PRE/POST checking
- better assert_state
- fixed issue of backward hook misfiring

* fix

* cleanup

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Myle Ott <myleott@fb.com>
Co-authored-by: Myle Ott <myleott@fb.com>

82986ca0

[chore] update to torch v1.8.0 (#508) · c79bbd01
msbaines authored Mar 11, 2021

c79bbd01

11 Mar, 2021 1 commit

[fix][OSS] Adding a hard sync stream barrier before broadcast (#512) · c9fdf506

Benjamin Lefaudeux authored Mar 11, 2021

* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene
* adding a proper unit test
* adding a unit test for https://github.com/facebookresearch/fairscale/pull/510

c9fdf506

09 Mar, 2021 2 commits
- [perf] Further improve performance for FSDP.no_sync (#502) · 0cbf3bab
  Myle Ott authored Mar 09, 2021
  
  0cbf3bab
- [fix] FSDP: fix MoE corner case (fixes #467) (#501) · 05ce7971
  Myle Ott authored Mar 08, 2021
  
  05ce7971
08 Mar, 2021 1 commit

[fix]: handle inputs with containers in mixed precision (#486) · 2e9a14e7

Min Xu authored Mar 08, 2021

* [fix]: handle inputs with containers

- this is an issue surfaces by vissl as well
- fix seems to be super simple
- also cleaned up two tests with respect to multiple such tests
  running back to back (they don't do that presently)

* cleanup

* fix

* lint

2e9a14e7