Commits · bc1e60e09c2437c284284e09c11ec9e6767a998b · OpenDAS / fairscale

26 Jun, 2021 1 commit
- Fix pytorch version check (#716) · bc1e60e0
  Pavel Belevich authored Jun 25, 2021
  
  bc1e60e0
08 May, 2021 1 commit
- [chore] Rename and move utils.py from optim/ to utils/ (#669) · 5739930f
  anj-s authored May 07, 2021
```
* rename and move optim/utils.py

* attach the new file
```
  5739930f
06 Apr, 2021 1 commit
- [fix][OSS] two small hotfixes.. repro not obvious for grad_fn (#583) · 121b9db0
  Benjamin Lefaudeux authored Apr 06, 2021
  
  121b9db0
05 Apr, 2021 1 commit
- [OSS/ShardedDDP] making APIs more private (#582) · e41452e8
  Benjamin Lefaudeux authored Apr 05, 2021
```
* making APIs more private
* linting
```
  e41452e8
04 Apr, 2021 1 commit
- [fix] OSS - enforce cuda parameters for state consolidation if NCCL backend (#573) · 88553373
  Benjamin Lefaudeux authored Apr 03, 2021
  
  88553373
19 Mar, 2021 1 commit
- [feat][refactor][OSS] Param buckets + fp16 broadcasts (#540) · e3865549
  Benjamin Lefaudeux authored Mar 19, 2021
```
* param buckets
* unifying the buckets
```
  e3865549
18 Mar, 2021 1 commit
- [fix][OSS] enabling disabled tests for 1.8 (#534) · 7b127ccb
  Benjamin Lefaudeux authored Mar 18, 2021
```
* enabling disabled tests
```
  7b127ccb
17 Mar, 2021 1 commit
- [feat][OSS] handle the device being changed after construction (#523) · d217278c
  Benjamin Lefaudeux authored Mar 16, 2021
  
  d217278c
15 Mar, 2021 1 commit

[feat] Make OSS state available on all ranks (#500) · 2d2412e2

Benjamin Lefaudeux authored Mar 15, 2021

* extending the current state_dict interface, make it possible to do everything in a single call, and to checkpoint on all ranks

2d2412e2

12 Mar, 2021 1 commit
- [chore] update to torch v1.8.0 (#508) · c79bbd01
  msbaines authored Mar 11, 2021
  
  c79bbd01
11 Mar, 2021 1 commit

[fix][OSS] Adding a hard sync stream barrier before broadcast (#512) · c9fdf506

Benjamin Lefaudeux authored Mar 11, 2021

* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene
* adding a proper unit test
* adding a unit test for https://github.com/facebookresearch/fairscale/pull/510

c9fdf506

09 Mar, 2021 1 commit
- [fix] oss and interleaved param groups (#483) · 02405740
  Benjamin Lefaudeux authored Mar 08, 2021
  
  02405740
05 Mar, 2021 1 commit
- [fix][minor] Change empty shard handling for OSS, do not rely on asserts (#460) · d1fab39e
  Benjamin Lefaudeux authored Mar 04, 2021
```
* change empty shard handling for OSS, do not rely on asserts
* code review
```
  d1fab39e
23 Feb, 2021 1 commit

Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e

Myle Ott authored Feb 22, 2021

Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336

) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.

Compared to PyTorch DDP:
* FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
* FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
* FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
    * all-gather parameters at start of forward pass and start of backward pass
    * reduce-scatter grads at end of backward pass
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

15512d9e

22 Feb, 2021 1 commit
- [fix][OSS] adding an assert for empty shards + corresponding unit test (#406) · 279b8024
  Benjamin Lefaudeux authored Feb 22, 2021
```
* adding an assert + corresponding unit test
* updated changelog
* adjusting the adascale tests
```
  279b8024
14 Feb, 2021 1 commit

[fix] OSS dict load/save fix - better fix than 383 and unit test (#386) · 54bd62d3

Benjamin Lefaudeux authored Feb 13, 2021

* WIP, needs to be fixed !

* should be a fix, many thanks Weiyi Zheng

* slightly better unit test, sorting the states on the way out

* reproducing the issue from Weiyi in a unit test, and finally properly fixing

* fixing unit test on pytorch1.5 - original loss diff 26.404895782470703 - 26.404342651367188

54bd62d3

12 Feb, 2021 1 commit

[feature-fix-refactor][ShardedDDP] Make it possible to change trainability graph on the fly (#369) · 13445c55

Benjamin Lefaudeux authored Feb 11, 2021

* Better unit testing
* Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time
* Enabling accumulation tests

13445c55

05 Feb, 2021 1 commit
- [fix] repro+fix (#365) · 8778fa66
  Benjamin Lefaudeux authored Feb 05, 2021
```
fix a broken earlier commit, only worked for the first step
```
  8778fa66
03 Feb, 2021 1 commit
- [chore] disheartening switch off of a OSS cpu test (#356) · 011c0c41
  Benjamin Lefaudeux authored Feb 03, 2021
```
* precise skip, only if agent has only cpu
```
  011c0c41
02 Feb, 2021 1 commit

[feat][OSS] elastic and pytorch compatible checkpoints (#310) · 9e8929e6

Benjamin Lefaudeux authored Feb 02, 2021

* adding a test to prove the inter operability with upstream pytorch
* updating the changelog
* eager state pruning
* pytorch 1.5 compat

9e8929e6

27 Jan, 2021 1 commit
- [fix] OSS Cpu tests (#333) · e6aef938
  Benjamin Lefaudeux authored Jan 27, 2021
  
  e6aef938
20 Jan, 2021 1 commit
- [fix] OSS tensor view corner case + corresponding unit tests (#315) · ce2f64f9
  Benjamin Lefaudeux authored Jan 19, 2021
  
  ce2f64f9
11 Jan, 2021 1 commit

[chore][ci] restore 1.5 & 1.6 tests and compatibility (#306) · 2d954203

Benjamin Lefaudeux authored Jan 11, 2021

* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing
* fixing oss backcompat, trying to fix rpc in old pytorch also
* fixing the file based init in torch 1.5

2d954203

08 Jan, 2021 3 commits
- [refactor][OSS] Adding a pytorch parity unit test (#298) · 3d02f052
  Benjamin Lefaudeux authored Jan 08, 2021
```
* adding a parity unit test
* code review, better testing, use torch defaults and check for the loss, log world size
```
  3d02f052
- [refactor][OSS] Removing ad-hoc object broadcast, use pytorch's (#297) · 3399e97c
  Benjamin Lefaudeux authored Jan 08, 2021
  
  3399e97c
- [feat] Support model parallelism in OSS (#287) · 9faad392
  Joshua Meier authored Jan 08, 2021
```
* add additional unit test
* support model parallelism in oss
```
  9faad392
05 Jan, 2021 1 commit

[fix] Flaky tests (#283) · 79365ee6

Benjamin Lefaudeux authored Jan 04, 2021

* adding the pytest timeout plugin to properly root out hanging tests
* removing redundant code, slightly more reasonable timeout, works on single cuda
* finding the root bug for some of the cpu hangs, rpc init
* propagating all the rpc init test changes to the pipe and model parallel tests

79365ee6

29 Dec, 2020 1 commit
- [feature] OSS: add unit test for distributed checkpointing (#273) · 60c8de4a
  Joshua Meier authored Dec 28, 2020
```
author: Joshua Meier
```
  60c8de4a
22 Dec, 2020 1 commit

[OSS] Balance the trainable params only (#262) · c386e937

Benjamin Lefaudeux authored Dec 21, 2020

* fix, one liner

* adjust so that frozen trunks get spread still, even if this should have little consequences

* removing dead code, hopeful unit test fix

* now with some linting..

* adding a proper unit test case

c386e937

06 Dec, 2020 1 commit
- [fix] skipping NCCL tests on 2-GPU systems (#233) · bb468670
  Min Xu authored Dec 05, 2020
  
  bb468670
16 Nov, 2020 1 commit
- [feat] OSS-aware clip grads, bridge sharded states (#167) · ade312c4
  Benjamin Lefaudeux authored Nov 16, 2020
```
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
```
  ade312c4
06 Nov, 2020 1 commit
- [fix] OSS tests - remove concurrent dist inits (#177) · 543d5693
  Benjamin Lefaudeux authored Nov 06, 2020
  
  543d5693
14 Oct, 2020 2 commits
- [bugfix] OSS + Apex (#136) · 37c686e7
  Benjamin Lefaudeux authored Oct 14, 2020
```
* fixing the issue wrt Apex, validated with Latte, Classy would need another pass
```
  37c686e7
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
08 Oct, 2020 1 commit
- [fix] OSS unit test to check data group (#129) · 81ac5b28
  Benjamin Lefaudeux authored Oct 08, 2020
```
* new unit test to catch rank issues in OSS
```
  81ac5b28
15 Sep, 2020 2 commits
- [feat] Gracefully handle local/global state dict queries (#89) · d16e9f61
  Benjamin Lefaudeux authored Sep 15, 2020
```
Return either the local or global state when queried, depending on a prior consolidation
```
  d16e9f61
- [feat ] OSS : optional closure argument for the optimizer (#86) · 3d7f524a
  Benjamin Lefaudeux authored Sep 15, 2020
```
Make OSS compatible with optimizers which do not support the closure argument
```
  3d7f524a
09 Sep, 2020 1 commit

[feat] OSS flatten state dict (#65) · 4f597233

Benjamin Lefaudeux authored Sep 09, 2020

Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading

4f597233

08 Sep, 2020 1 commit

[feat] OSS: Sync all attributes (#67) · 5a268b25

Benjamin Lefaudeux authored Sep 08, 2020

Make sure that all attributes (not just LR) are in sync in between the OSS.param_groups and the actual wrapped optimizer. Some frameworks make it possible to alter any attribute on a scheduled basis, which proves useful depending on the optimizer, so the keys need to be generically supported (not just "lr"). Not syncing these attributes is a worst case scenario, since these adjustments are silently not propagated, fixing that.

5a268b25

03 Sep, 2020 1 commit

[fix] OSS pytorch-compliant state dict (#61) · 1d1d15ea

Benjamin Lefaudeux authored Sep 03, 2020

* Aligning the optimizer state dict with what PyTorch expects

* Adding a check on the dict keys, ensure that `state` and `param_groups` are there

* after installing the specific isort, black and all, one liner to please the linter..

1d1d15ea