Commits · 175fdeb03e51108bf67d386c6709a4446aa8a3a2 · OpenDAS / fairscale

19 Feb, 2021 1 commit

[feature] Unit test with and without buckets for all ShardedDDP unit tests (#400) · 175fdeb0

Benjamin Lefaudeux authored Feb 19, 2021

* test with and without buckets for all the shardedDDP unit tests
* parametrize all the things
* refactoring, adding even more  combinations at times
* handle hosts not having cuda

175fdeb0

18 Feb, 2021 2 commits
- [feat][ShardedDDP] Support multiple groups (#394) · 205af8c2
  Benjamin Lefaudeux authored Feb 18, 2021
```
* Adding multiple groups support to ShardedDDP + unit test
* adding gloo to the backends tested for multiple groups
```
  205af8c2
- [fix][minor] ShardedDDP train/eval modes (#393) · ef7146d5
  Benjamin Lefaudeux authored Feb 18, 2021
```
* [fix] ShardedDDP train/eval modes
* Update CHANGELOG.md
```
  ef7146d5
17 Feb, 2021 1 commit
- [feat][ShardedDDP] manual reduce option (#389) · 47042917
  Benjamin Lefaudeux authored Feb 16, 2021
```
* initial implementation, with unit test and assert
* added changelog and better debug string
```
  47042917
12 Feb, 2021 1 commit

[feature-fix-refactor][ShardedDDP] Make it possible to change trainability graph on the fly (#369) · 13445c55

Benjamin Lefaudeux authored Feb 11, 2021

* Better unit testing
* Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time
* Enabling accumulation tests

13445c55

10 Feb, 2021 1 commit

Add fairscale.nn.misc.checkpoint_activations (#376) · c963a72a

Myle Ott authored Feb 10, 2021



* Add fairscale.utils.containers
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Add fairscale.nn.misc.checkpoint_activations
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

c963a72a

09 Feb, 2021 1 commit
- [refactor] remove multiprocess dependency on async (#373) · b1b9e0f8
  msbaines authored Feb 08, 2021
  
  b1b9e0f8
04 Feb, 2021 4 commits
- [refactor] multiprocess_pipe: remove pipelined_backward (#362) · 42e44149
  msbaines authored Feb 04, 2021
  
  42e44149
- [feat] ShardedDDP : Adding a proper DDP parity / AMP unit test, overdue (#361) · 5c3ff9bd
  Benjamin Lefaudeux authored Feb 04, 2021
```
* Adding a proper ddp parity / AMP unit test, overdue
* catch non-AMP pytorch
```
  5c3ff9bd
- [refactor] multiprocess_pipe: focus on LazyModule usage (#360) · e3a20fef
  msbaines authored Feb 03, 2021
  
  e3a20fef
- [refactor] multiprocess_pipe: cleanup __init__ (#357) · 39675773
  msbaines authored Feb 03, 2021
  
  39675773
03 Feb, 2021 2 commits
- [refactor] pipe: simplify balance and module checks (#346) · f21b5ffc
  msbaines authored Feb 03, 2021
  
  f21b5ffc
- [fix] ShardedDDP - properly handle post device change (#353) · a265586b
  Benjamin Lefaudeux authored Feb 02, 2021
```
* adding the .to(device) support + unit testing
* doc update
```
  a265586b
02 Feb, 2021 1 commit

[fix] ShardedDDP - cpu testfix - remove Gloo/CPU (#350) · c2dd6c34

Benjamin Lefaudeux authored Feb 01, 2021

* no idea about the root issue, but it proved to be fairly narrowed (gloo+cpu+python3.8+no cuda installed) so I guess that's out of scope for fairscale

c2dd6c34

30 Jan, 2021 1 commit
- [refactor] pipe: move async-specific code out of MultiProcessPipe (#344) · a8dd9254
  msbaines authored Jan 30, 2021
  
  a8dd9254
29 Jan, 2021 1 commit
- [refactor] make AsyncPipe its own class (#341) · eaee5976
  msbaines authored Jan 29, 2021
  
  eaee5976
27 Jan, 2021 1 commit
- [refactor] pipe: separate out Single and MultiProcess pipe (#326) · cae9b638
  msbaines authored Jan 26, 2021
  
  cae9b638
23 Jan, 2021 1 commit

[feat] Add AMPnet implementation in experimental dir (#304) · 14491030

Siddharth Goyal authored Jan 22, 2021

* Add AMPnet implementation (clean version)

* Move ampnet to experimental

* Move stuff around pipeline

* Address review comments and fix pre-commit errors

* Refactor and modify delegate functionality

* Modify header in pipe.py

14491030

21 Jan, 2021 3 commits
- [fix] Lint flattenparams (#320) · bd5d0496
  Benjamin Lefaudeux authored Jan 21, 2021
```
* working around broken mypy
```
  bd5d0496
- [fix] lint/typing in FlattenParamsWrapper (#318) · a6ed6da8
  Myle Ott authored Jan 21, 2021
  
  a6ed6da8
- Add FlattenParamsWrapper (#317) · 35fdf537
  Myle Ott authored Jan 21, 2021
  
  35fdf537
15 Jan, 2021 1 commit
- [feat][ShardedDDP] Support the original module's attributes (#309) · 3e2547c3
  Benjamin Lefaudeux authored Jan 15, 2021
```
* minor, but ease of life, one less papercut
```
  3e2547c3
11 Jan, 2021 1 commit

[chore][ci] restore 1.5 & 1.6 tests and compatibility (#306) · 2d954203

Benjamin Lefaudeux authored Jan 11, 2021

* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing
* fixing oss backcompat, trying to fix rpc in old pytorch also
* fixing the file based init in torch 1.5

2d954203

05 Jan, 2021 1 commit

[fix] Flaky tests (#283) · 79365ee6

Benjamin Lefaudeux authored Jan 04, 2021

* adding the pytest timeout plugin to properly root out hanging tests
* removing redundant code, slightly more reasonable timeout, works on single cuda
* finding the root bug for some of the cpu hangs, rpc init
* propagating all the rpc init test changes to the pipe and model parallel tests

79365ee6

02 Jan, 2021 1 commit
- [fix] Typo in ShardedDDP unit test (#282) · 84a3bdbe
  Benjamin Lefaudeux authored Jan 01, 2021
```
* fix typo, backend for CPU test
```
  84a3bdbe
30 Dec, 2020 1 commit
- [feat] Add Torch Sync Batchnorm handle in sharded DDP (#265) · 1c8d219d
  Sean Naren authored Dec 30, 2020
```
* Add function to add handle for sync BN
* Add test to ensure batch norm handles have been added
```
  1c8d219d
29 Dec, 2020 1 commit
- [hotfix] Catching properly a given test failing if not enough gpus (#274) · 7abaa2be
  Benjamin Lefaudeux authored Dec 28, 2020
```
* catching properly a given test failing if not enough gpus
```
  7abaa2be
28 Dec, 2020 1 commit
- [chore] Move all unit tests dist init to being file based (#272) · b640cab5
  Benjamin Lefaudeux authored Dec 28, 2020
```
* file based dist init
* nicer handling of broken world sizes vs. number of available GPUs, do not break but warn out
```
  b640cab5
19 Dec, 2020 1 commit

[OSS] Getting rid of the "should bucket" hash table, just use a list + non... · ca74ee22

Benjamin Lefaudeux authored Dec 19, 2020

[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259)

* Getting rid of the "should bucket" hash table, just use a list
Properly handle all params, with or without requires_grad

* make sure that this case is unit tested

ca74ee22

10 Dec, 2020 1 commit

[fix] Check ShardedDDP / DDP parity + bugfix (#242) · 138b2033

Benjamin Lefaudeux authored Dec 09, 2020

* unit test checking ddp and sharded_ddp equivalence, reproducing the issue that Sean spotted
* fixing the issue, not counting requests in flight properly
* adding a multiple optimizers case

138b2033

04 Dec, 2020 1 commit

[fix] Fix iGPT buckets with ShardedDDP (#223) · 6d223777

Benjamin Lefaudeux authored Dec 03, 2020

* proper unit testing, but no other solution than disabling bucketing for now, couple of options tested do not work

6d223777

01 Dec, 2020 2 commits
- [chore] Refactor unit testing, shared utils (#218) · e83da060
  Benjamin Lefaudeux authored Dec 01, 2020
  
  e83da060
- [fix][Pipe] fallback for Pipe tests on internal pytorch numbering (#216) · 4d8f2e59
  Benjamin Lefaudeux authored Nov 30, 2020
```
* fallback on internal pytorch numbering
```
  4d8f2e59
21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

18 Nov, 2020 1 commit
- fix bug (#193) · f80b303c
  Tom Birch authored Nov 17, 2020
  
  f80b303c
11 Nov, 2020 2 commits
- [fix] moe: fix bug for multiple experts per-gpu case (#184) · 317c0945
  msbaines authored Nov 11, 2020
  
  317c0945
- [refactor] moe: remove G dimension (#183) · 89176e34
  msbaines authored Nov 11, 2020
  
  89176e34
10 Nov, 2020 1 commit

Single-process control via PipeRPCWrapper (#156) · 5d4f50fb

Tom Birch authored Nov 10, 2020

Adds support for:
* Reused layers (e.g. for weight sharing)
* Lazily-constructed layers
* Single-process control via PipeRPCWrapper
* PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive

Also added examples for multi-process and PipeRPCWrapper

5d4f50fb

30 Oct, 2020 1 commit
- [chore] add circleci testing of torch==1.5.1 (#172) · 4247f602
  msbaines authored Oct 29, 2020
  
  4247f602
29 Oct, 2020 1 commit
- [chore] update to torch v1.7.0 (#171) · ace61a41
  msbaines authored Oct 28, 2020
  
  ace61a41