Commits · e89a1916e7472dd59705d95a862d7874a3bd2ed1 · OpenDAS / fairscale

06 Apr, 2021 1 commit
- [fix][OSS] two small hotfixes.. repro not obvious for grad_fn (#583) · 121b9db0
  Benjamin Lefaudeux authored Apr 06, 2021
  
  121b9db0
05 Apr, 2021 1 commit
- [OSS/ShardedDDP] making APIs more private (#582) · e41452e8
  Benjamin Lefaudeux authored Apr 05, 2021
```
* making APIs more private
* linting
```
  e41452e8
30 Mar, 2021 1 commit

[feat][fix] ShardedDDP deferred init (#558) · daa1bad5

Benjamin Lefaudeux authored Mar 30, 2021

* survive the model being moved to device post-construction
* make sure that a unit test would catch a regression

daa1bad5

25 Mar, 2021 1 commit
- [chore][fix] SDP: yet another unit test improvement + bugfixes (#546) · ece0cbf9
  Benjamin Lefaudeux authored Mar 25, 2021
```
* re-activating unit test
* removing changed that slipped in
```
  ece0cbf9
18 Mar, 2021 2 commits
- [refactor][fix][SDP] Extract the grad buckets in a dedicated class, fix the resize_ bug (#532) · a1bdc7d3
  Benjamin Lefaudeux authored Mar 18, 2021
```
* extracting the buckets in a dedicated class, fixing the resize_ bug
* adding a unit test
* copyright
```
  a1bdc7d3
- [refactor] removing dead or faulty code (#530) · 142cfdcc
  Benjamin Lefaudeux authored Mar 17, 2021
  
  142cfdcc
17 Mar, 2021 1 commit
- [fix][SDP] Lightning-compat: deactivating buckets for a single rank, not useful (#514) · d3bfcbf5
  Benjamin Lefaudeux authored Mar 17, 2021
```
* Deactivating buckets for a single rank, not crashing but not useful
```
  d3bfcbf5
09 Mar, 2021 1 commit
- [fix] flaky SDP tests with Gloo, checking all handles (#499) · 9c4e6d1a
  Benjamin Lefaudeux authored Mar 09, 2021
```
* seemingly fix flakyness for gloo by checking all coms handles
```
  9c4e6d1a
05 Mar, 2021 1 commit
- [perf][minor] cache the rank lookups, small shardedddp perf fix (#474) · 131a5356
  Benjamin Lefaudeux authored Mar 05, 2021
```
* [perf][minor] cache the rank lookups, small shardedddp perf fix
* tiny improvement, code quality
```
  131a5356
04 Mar, 2021 1 commit

[test] AdaScale & SDP/FSDP (#468) · efed9cee

Min Xu authored Mar 04, 2021

- cover them in terms of code path only
- numerically, AdaScale is different on SDP/FSDP than DDP, mainly
  due to partial view of the gradients.
- this doesn't mean it is definitely not useful but it is yet to
  be validated.
- not going to spend too much time until we have a real use case.

efed9cee

25 Feb, 2021 1 commit
- [ShardedDDP][Minor] Backport a bucket flush fix from FSDP, may help a few existing users (#435) · 7ee228bf
  Benjamin Lefaudeux authored Feb 25, 2021
```
* bring back a fix from FSDP, may help a few existing users
```
  7ee228bf
23 Feb, 2021 2 commits
- [hotfix] ShardedDDP fp16 grads, default flipped while testing (#417) · 8fd82858
  Benjamin Lefaudeux authored Feb 22, 2021
  
  8fd82858
- [perf][ShardedDDP] fp16 gradient reduce (#411) · d52d2186
  Benjamin Lefaudeux authored Feb 22, 2021
```
* POC, testing against the DDP comm hook when available
* docs, adding a reference to DDP's compress hook
* updating changelog, prep for v0.1.8 release
```
  d52d2186
19 Feb, 2021 1 commit

[feature] Unit test with and without buckets for all ShardedDDP unit tests (#400) · 175fdeb0

Benjamin Lefaudeux authored Feb 19, 2021

* test with and without buckets for all the shardedDDP unit tests
* parametrize all the things
* refactoring, adding even more  combinations at times
* handle hosts not having cuda

175fdeb0

18 Feb, 2021 2 commits
- [feat][ShardedDDP] Support multiple groups (#394) · 205af8c2
  Benjamin Lefaudeux authored Feb 18, 2021
```
* Adding multiple groups support to ShardedDDP + unit test
* adding gloo to the backends tested for multiple groups
```
  205af8c2
- [fix][minor] ShardedDDP train/eval modes (#393) · ef7146d5
  Benjamin Lefaudeux authored Feb 18, 2021
```
* [fix] ShardedDDP train/eval modes
* Update CHANGELOG.md
```
  ef7146d5
17 Feb, 2021 1 commit
- [feat][ShardedDDP] manual reduce option (#389) · 47042917
  Benjamin Lefaudeux authored Feb 16, 2021
```
* initial implementation, with unit test and assert
* added changelog and better debug string
```
  47042917
12 Feb, 2021 1 commit

[feature-fix-refactor][ShardedDDP] Make it possible to change trainability graph on the fly (#369) · 13445c55

Benjamin Lefaudeux authored Feb 11, 2021

* Better unit testing
* Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time
* Enabling accumulation tests

13445c55

05 Feb, 2021 1 commit
- [perf] ShardedDDP - small memory use reduction - minor speedup (#366) · 4dc605c9
  Benjamin Lefaudeux authored Feb 04, 2021
```
* minor

* minor
```
  4dc605c9
03 Feb, 2021 1 commit
- [fix] ShardedDDP - properly handle post device change (#353) · a265586b
  Benjamin Lefaudeux authored Feb 02, 2021
```
* adding the .to(device) support + unit testing
* doc update
```
  a265586b
29 Jan, 2021 1 commit
- [ShardedDDP] Bucketing reduce calls, tensor views (#327) · 51625eda
  Benjamin Lefaudeux authored Jan 28, 2021
  
  51625eda
21 Jan, 2021 1 commit
- [perf] ShardedDDP & OSS, small improvements (#321) · dd441e9d
  Benjamin Lefaudeux authored Jan 21, 2021
```
* Couple of small improvements, no logic changes
```
  dd441e9d
15 Jan, 2021 1 commit
- [feat][ShardedDDP] Support the original module's attributes (#309) · 3e2547c3
  Benjamin Lefaudeux authored Jan 15, 2021
```
* minor, but ease of life, one less papercut
```
  3e2547c3
08 Jan, 2021 2 commits
- [doc] Minor additions to ShardedDDP docs (#299) · b202804a
  Benjamin Lefaudeux authored Jan 08, 2021
  
  b202804a
- [perf][minor] ShardedDDP micro-optim (#296) · 11beea69
  Benjamin Lefaudeux authored Jan 08, 2021
```
* minor, not life changing but removing a dependency on runtime optim
```
  11beea69
02 Jan, 2021 1 commit
- [fix] Typo in ShardedDDP unit test (#282) · 84a3bdbe
  Benjamin Lefaudeux authored Jan 01, 2021
```
* fix typo, backend for CPU test
```
  84a3bdbe
30 Dec, 2020 1 commit
- [feat] Add Torch Sync Batchnorm handle in sharded DDP (#265) · 1c8d219d
  Sean Naren authored Dec 30, 2020
```
* Add function to add handle for sync BN
* Add test to ensure batch norm handles have been added
```
  1c8d219d
19 Dec, 2020 1 commit

[OSS] Getting rid of the "should bucket" hash table, just use a list + non... · ca74ee22

Benjamin Lefaudeux authored Dec 19, 2020

[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259)

* Getting rid of the "should bucket" hash table, just use a list
Properly handle all params, with or without requires_grad

* make sure that this case is unit tested

ca74ee22

16 Dec, 2020 2 commits

[perf] ShardedDDP: better handling of the callback queue, try to consume it as we go. (#254) · 351f35e1
Benjamin Lefaudeux authored Dec 16, 2020
```
* Better handling of the callback queue, try to consume it as we go.

* dumping buckets for the reduce part, always the same unused params issue
```
351f35e1

[feat] add CPU support to tutorials in examples + factorize tutorials (#247) · 02478eb3

jessijzhao authored Dec 15, 2020

* [feat] add CPU support to tutorials in examples

* now works on a machine without cuda
* fixes some minor typos

* [cleanup] factorize tutorials in examples

* collects duplicate code across tutorials in helpers.py

* [fix] getData in tutorials now returns iterable

02478eb3

15 Dec, 2020 1 commit
- [cleanup] ShardedDDP - inline gatekeeper (#248) · 4402c410
  Benjamin Lefaudeux authored Dec 15, 2020
  
  4402c410
04 Dec, 2020 1 commit

[fix] Fix iGPT buckets with ShardedDDP (#223) · 6d223777

Benjamin Lefaudeux authored Dec 03, 2020

* proper unit testing, but no other solution than disabling bucketing for now, couple of options tested do not work

6d223777

21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

21 Oct, 2020 1 commit

[fix] fixing adascale all_reduce (#155) · 6802ad49

Min Xu authored Oct 21, 2020

- Aurick noticed this bug and I ran into it yesterday
- after the fix, our cifar training shows same gain values from
  different replics now:

```
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3512124098087777
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3512124098087777
20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000600 fwd 0:00:00.003678 loss 0:00:00.000086 bwd 0:00:00.314158 update 0:00:00.002132 rest 0:00:00.000399
20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000643 fwd 0:00:00.003460 loss 0:00:00.000084 bwd 0:00:00.314678 update 0:00:00.002001 rest 0:00:00.000408
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3514997779980324
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3514997779980324
20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000732 fwd 0:00:00.003689 loss 0:00:00.000086 bwd 0:00:00.314176 update 0:00:00.002146 rest 0:00:00.000397
20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000646 fwd 0:00:00.003542 loss 0:00:00.000089 bwd 0:00:00.314549 update 0:00:00.001956 rest 0:00:00.000392
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.352149646693932
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.352149646693932
```

6802ad49

06 Oct, 2020 1 commit

[feat] OSS/SDP : bucketing (#122) · 341d8b2b

Benjamin Lefaudeux authored Oct 05, 2020

Same bucketing strategy for OSS and SDP:
sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed

341d8b2b

29 Sep, 2020 1 commit
- [ShardedDDP] Sync buffers + small cleanup (#112) · 79ded821
  Benjamin Lefaudeux authored Sep 28, 2020
```
- adding the buffer broadcast option
- minor cleanup in shardedDDP
```
  79ded821
17 Sep, 2020 1 commit

[feat] Sharded DDP - small refactor and new features (#97) · 49a198c9

Benjamin Lefaudeux authored Sep 17, 2020

- rename oss_ddp to ShardedDataParallel
- some refactoring
- ShardedDataParallel owns the sharded optimizer, exposed if need be
- some small perf bumps

49a198c9

28 Aug, 2020 1 commit
- [fix] fix eval for oss_ddp (#55) · 8c8eb8e8
  Min Xu authored Aug 28, 2020
```
- added train(mode) method to be aware of eval mode
```
  8c8eb8e8
06 Aug, 2020 1 commit
- [feat] add ddp that works with oss with reduce() not all_reduce() (#19) · 525e709b
  Min Xu authored Aug 06, 2020
```
Co-authored-by: Min Xu <m1n@fb.com>
```
  525e709b