Commits · 13445c555aa459bf10b5be9f496937acaca72006 · OpenDAS / fairscale

12 Feb, 2021 1 commit

[feature-fix-refactor][ShardedDDP] Make it possible to change trainability graph on the fly (#369) · 13445c55

Benjamin Lefaudeux authored Feb 11, 2021

* Better unit testing
* Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time
* Enabling accumulation tests

13445c55

05 Feb, 2021 1 commit
- [fix] repro+fix (#365) · 8778fa66
  Benjamin Lefaudeux authored Feb 05, 2021
```
fix a broken earlier commit, only worked for the first step
```
  8778fa66
03 Feb, 2021 1 commit
- [chore] disheartening switch off of a OSS cpu test (#356) · 011c0c41
  Benjamin Lefaudeux authored Feb 03, 2021
```
* precise skip, only if agent has only cpu
```
  011c0c41
02 Feb, 2021 1 commit

[feat][OSS] elastic and pytorch compatible checkpoints (#310) · 9e8929e6

Benjamin Lefaudeux authored Feb 02, 2021

* adding a test to prove the inter operability with upstream pytorch
* updating the changelog
* eager state pruning
* pytorch 1.5 compat

9e8929e6

27 Jan, 2021 1 commit
- [fix] OSS Cpu tests (#333) · e6aef938
  Benjamin Lefaudeux authored Jan 27, 2021
  
  e6aef938
20 Jan, 2021 1 commit
- [fix] OSS tensor view corner case + corresponding unit tests (#315) · ce2f64f9
  Benjamin Lefaudeux authored Jan 19, 2021
  
  ce2f64f9
11 Jan, 2021 1 commit

[chore][ci] restore 1.5 & 1.6 tests and compatibility (#306) · 2d954203

Benjamin Lefaudeux authored Jan 11, 2021

* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing
* fixing oss backcompat, trying to fix rpc in old pytorch also
* fixing the file based init in torch 1.5

2d954203

08 Jan, 2021 3 commits
- [refactor][OSS] Adding a pytorch parity unit test (#298) · 3d02f052
  Benjamin Lefaudeux authored Jan 08, 2021
```
* adding a parity unit test
* code review, better testing, use torch defaults and check for the loss, log world size
```
  3d02f052
- [refactor][OSS] Removing ad-hoc object broadcast, use pytorch's (#297) · 3399e97c
  Benjamin Lefaudeux authored Jan 08, 2021
  
  3399e97c
- [feat] Support model parallelism in OSS (#287) · 9faad392
  Joshua Meier authored Jan 08, 2021
```
* add additional unit test
* support model parallelism in oss
```
  9faad392
05 Jan, 2021 1 commit

[fix] Flaky tests (#283) · 79365ee6

Benjamin Lefaudeux authored Jan 04, 2021

* adding the pytest timeout plugin to properly root out hanging tests
* removing redundant code, slightly more reasonable timeout, works on single cuda
* finding the root bug for some of the cpu hangs, rpc init
* propagating all the rpc init test changes to the pipe and model parallel tests

79365ee6

29 Dec, 2020 1 commit
- [feature] OSS: add unit test for distributed checkpointing (#273) · 60c8de4a
  Joshua Meier authored Dec 28, 2020
```
author: Joshua Meier
```
  60c8de4a
22 Dec, 2020 1 commit

[OSS] Balance the trainable params only (#262) · c386e937

Benjamin Lefaudeux authored Dec 21, 2020

* fix, one liner

* adjust so that frozen trunks get spread still, even if this should have little consequences

* removing dead code, hopeful unit test fix

* now with some linting..

* adding a proper unit test case

c386e937

06 Dec, 2020 1 commit
- [fix] skipping NCCL tests on 2-GPU systems (#233) · bb468670
  Min Xu authored Dec 05, 2020
  
  bb468670
16 Nov, 2020 1 commit
- [feat] OSS-aware clip grads, bridge sharded states (#167) · ade312c4
  Benjamin Lefaudeux authored Nov 16, 2020
```
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
```
  ade312c4
06 Nov, 2020 1 commit
- [fix] OSS tests - remove concurrent dist inits (#177) · 543d5693
  Benjamin Lefaudeux authored Nov 06, 2020
  
  543d5693
14 Oct, 2020 2 commits
- [bugfix] OSS + Apex (#136) · 37c686e7
  Benjamin Lefaudeux authored Oct 14, 2020
```
* fixing the issue wrt Apex, validated with Latte, Classy would need another pass
```
  37c686e7
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
08 Oct, 2020 1 commit
- [fix] OSS unit test to check data group (#129) · 81ac5b28
  Benjamin Lefaudeux authored Oct 08, 2020
```
* new unit test to catch rank issues in OSS
```
  81ac5b28
15 Sep, 2020 2 commits
- [feat] Gracefully handle local/global state dict queries (#89) · d16e9f61
  Benjamin Lefaudeux authored Sep 15, 2020
```
Return either the local or global state when queried, depending on a prior consolidation
```
  d16e9f61
- [feat ] OSS : optional closure argument for the optimizer (#86) · 3d7f524a
  Benjamin Lefaudeux authored Sep 15, 2020
```
Make OSS compatible with optimizers which do not support the closure argument
```
  3d7f524a
09 Sep, 2020 1 commit

[feat] OSS flatten state dict (#65) · 4f597233

Benjamin Lefaudeux authored Sep 09, 2020

Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading

4f597233

08 Sep, 2020 1 commit

[feat] OSS: Sync all attributes (#67) · 5a268b25

Benjamin Lefaudeux authored Sep 08, 2020

Make sure that all attributes (not just LR) are in sync in between the OSS.param_groups and the actual wrapped optimizer. Some frameworks make it possible to alter any attribute on a scheduled basis, which proves useful depending on the optimizer, so the keys need to be generically supported (not just "lr"). Not syncing these attributes is a worst case scenario, since these adjustments are silently not propagated, fixing that.

5a268b25

03 Sep, 2020 1 commit

[fix] OSS pytorch-compliant state dict (#61) · 1d1d15ea

Benjamin Lefaudeux authored Sep 03, 2020

* Aligning the optimizer state dict with what PyTorch expects

* Adding a check on the dict keys, ensure that `state` and `param_groups` are there

* after installing the specific isort, black and all, one liner to please the linter..

1d1d15ea

28 Aug, 2020 1 commit

[fix] optim/oss: work correctly with LRScheduler (#58) · ab32cb7d

msbaines authored Aug 28, 2020

* [fix] optim/oss: work correctly with LRScheduler

Sync lr before every step and before consolidate.

ab32cb7d

27 Aug, 2020 3 commits
- [refactor] optim/oss: save memory and time by avoiding duplicate copy of parameters (#57) · e4a0804c
  msbaines authored Aug 27, 2020
  
  e4a0804c
- [fix] optim/oss: PyTorch already handles putting state on proper device (#54) · 220ee323
  msbaines authored Aug 27, 2020
  
  220ee323
- [fix] optim/oss: support optimizers with additional step kwargs (#53) · 09028a0d
  msbaines authored Aug 26, 2020
```
* [fix] optim/oss: support optimizers with additional step kwargs

Some of the optimizers in apex support additional kwargs to step
such as scale.
```
  09028a0d
20 Aug, 2020 1 commit
- [fix] OSS restore state to proper device (#46) · c2d6f4b6
  Benjamin Lefaudeux authored Aug 20, 2020
```
* move the restored param groups to the original device

* adding a corresponding test
```
  c2d6f4b6
14 Aug, 2020 1 commit

[fix] Properly restore a sharded optim state (#39) · 585f177b

Benjamin Lefaudeux authored Aug 14, 2020



* hotfix a half-cooked optimizer state restoration, the global shared state also needs to be restored

* [cleanup] get 100% coverage on oss.py (#38)
authored-by: Mandeep Singh Baines <msb@fb.com>

* better unit testing, check that the .param_groups attribute is properly in sync with the loaded state
Co-authored-by: msbaines <35972327+msbaines@users.noreply.github.com>

585f177b

13 Aug, 2020 1 commit

Aligning OSS state dict with... · 57079b08

Benjamin Lefaudeux authored Aug 12, 2020

Aligning OSS state dict with `https://pytorch.org/docs/stable/_modules/torch/optim/optimizer.html#Optimizer` (#31)

57079b08

08 Aug, 2020 1 commit
- [fix] fix test_oss.py when host have 2 GPUs (#26) · d9e6ceaa
  Min Xu authored Aug 07, 2020
```
Co-authored-by: Min Xu <m1n@fb.com>
```
  d9e6ceaa
31 Jul, 2020 1 commit
- [feat] Implement OSS save and load of the sharded state from a single replica (#16) · 8e363567
  Benjamin Lefaudeux authored Jul 31, 2020
  
  8e363567
08 Jul, 2020 1 commit
- Initial commit · 0cd65242
  Mandeep Singh Baines authored Jul 07, 2020
  
  0cd65242