Commits · 61bb32b543bc2d9e7a3d2020718557e1ab076f75 · OpenDAS / fairscale

20 Oct, 2020 3 commits
- [refactor][minor] OSS - small refactor of the bucketing (#153) · 61bb32b5
  Benjamin Lefaudeux authored Oct 20, 2020
```
* small refactor, code cleanup
* broadcast tensor .data attribute directly
```
  61bb32b5
- [test] fine tune test for checkpoint & DDP (#148) · 66b2b514
  Min Xu authored Oct 20, 2020
```
- fixed typing
- make it run less often to reduce CI time

testing: run it in a loop make sure it is run in the right frequency.
```
  66b2b514
- [cleanup] mypy adascale (#149) · a0042113
  Min Xu authored Oct 20, 2020
```
- close #143
```
  a0042113
18 Oct, 2020 1 commit
- [docs][minor] fixing the readme example for oss (#147) · 58e97aa6
  Benjamin Lefaudeux authored Oct 17, 2020
```
* fixing the readme for oss
```
  58e97aa6
17 Oct, 2020 2 commits
- [feat][minor] OSS: benchmark - adding a cpu option (#144) · 10062e58
  Benjamin Lefaudeux authored Oct 16, 2020
```
* adding a cpu option
* adjust the reference loss
```
  10062e58
- [cleanup] moe: rename moelayer.py to moe_layer.py (#141) · 61234360
  msbaines authored Oct 16, 2020
  
  61234360
16 Oct, 2020 4 commits
- [fix] fixing circleCI for AdaScale (#142) · a65fc83e
  Min Xu authored Oct 16, 2020
```
* [fix] fixing circleCI for AdaScale

- ran black, isort, flake8, mypy

* more fix
```
  a65fc83e
- [feat] Add implementation of AdaScale (#139) · 64d1e312
  Aurick Qiao authored Oct 16, 2020
```
* Add implementation of AdaScale

* add adascale docs
```
  64d1e312
- [feat] moe: annotate expert params (#140) · ee88bb19
  msbaines authored Oct 16, 2020
```
The expert annotation is used by clip_grads and DDP.
```
  ee88bb19
- [feat] moe: add all_to_all backward support (#137) · d99c445a
  msbaines authored Oct 16, 2020
  
  d99c445a
15 Oct, 2020 1 commit
- [chore] create v0.0.3 (#138) · 1e6c547a
  msbaines authored Oct 14, 2020
  
  1e6c547a
14 Oct, 2020 3 commits
- [feat] OSS: adding a --profile option to the benchmark (#135) · 34915bf8
  Benjamin Lefaudeux authored Oct 14, 2020
  
  34915bf8
- [bugfix] OSS + Apex (#136) · 37c686e7
  Benjamin Lefaudeux authored Oct 14, 2020
```
* fixing the issue wrt Apex, validated with Latte, Classy would need another pass
```
  37c686e7
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
10 Oct, 2020 1 commit
- [bugfix] OSS no reduce loss (#133) · 177151e0
  Benjamin Lefaudeux authored Oct 09, 2020
```
* bugfix
* adjust default non-regression loss, not all_reduced now
```
  177151e0
09 Oct, 2020 2 commits
- [minor] OSS doc fix - add the DDP wrap (#131) · 5220f89b
  Benjamin Lefaudeux authored Oct 09, 2020
```
* wrapping the model in DDP in the tutorial

* typo
```
  5220f89b
- [minor] OSS: bring DDP in the benchmark (#130) · bfd88cad
  Benjamin Lefaudeux authored Oct 08, 2020
```
More realistic benchmarks, comparing apples to apples. DDP/OSS+DDP/OSS+SDP
```
  bfd88cad
08 Oct, 2020 4 commits

[fix] OSS unit test to check data group (#129) · 81ac5b28
Benjamin Lefaudeux authored Oct 08, 2020
```
* new unit test to catch rank issues in OSS
```
81ac5b28
[feat] moe: initial implementation of MOELayer (#128) · 22ff665d
msbaines authored Oct 08, 2020
```
Currently only implemented for a single process and expert.
```
22ff665d
[fix] megatron + oss (#127) · 82dbd5d8
ngoyal2707 authored Oct 08, 2020
```
authored-by: Naman Goyal <namangoyal@learnfair0755.h2.fair>
```
82dbd5d8

[test] Add unittest for checkpoint & DDP (#126) · 6658be22

Min Xu authored Oct 07, 2020

* Add unittest for checkpoint & DDP

- this change adds test cases to reproduce the error with checkpoint & DDP
- mandeep mentioned that there is also deadlock in this case, but this
  change doesn't cover that.
- we cover cases where weight sharing is OK
- however, same module multiple checkpoint or find_unused_parameters are
  both not OK

* added norm checks

6658be22

06 Oct, 2020 2 commits

[feat] OSS/SDP : bucketing (#122) · 341d8b2b

Benjamin Lefaudeux authored Oct 05, 2020

Same bucketing strategy for OSS and SDP:
sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed

341d8b2b

[refactor] moe: simplify logic removing top expert (#125) · 6e7ad798
msbaines authored Oct 05, 2020

6e7ad798

05 Oct, 2020 1 commit
- [fix] moe: fix Top2Gate to work on GPU (#124) · 662667d0
  msbaines authored Oct 05, 2020
  
  662667d0
02 Oct, 2020 1 commit
- [feat] moe: initial implementation of Top2Gating (#118) · 7815f6f3
  msbaines authored Oct 01, 2020
  
  7815f6f3
01 Oct, 2020 3 commits
- [fix] re-run black to fix CPU tests on master (#123) · 2eee136f
  msbaines authored Oct 01, 2020
  
  2eee136f
- Support optimizer state sharding for megatron (#121) · 379c6bf0
  Joshua Meier authored Oct 01, 2020
```
support optimizer state sharding for megatron
```
  379c6bf0
- [fix] OSS: Eager gradient release - free memory (#120) · 1c2a6f6b
  Benjamin Lefaudeux authored Sep 30, 2020
```
* minor, but gives some memory back
* adjust CI and regression checks to 4 gpu
```
  1c2a6f6b
29 Sep, 2020 1 commit
- [ShardedDDP] Sync buffers + small cleanup (#112) · 79ded821
  Benjamin Lefaudeux authored Sep 28, 2020
```
- adding the buffer broadcast option
- minor cleanup in shardedDDP
```
  79ded821
24 Sep, 2020 3 commits
- Update README.md (#110) · 41819af9
  Vittorio Caggiano authored Sep 24, 2020
```
add badges and link to readthedoc
```
  41819af9
- Update oss.rst (#107) · 274478d0
  Vittorio Caggiano authored Sep 24, 2020
  
  274478d0
- [fix] OSS benchmark cleanup (#109) · 53553474
  Benjamin Lefaudeux authored Sep 24, 2020
```
- small benchmark refactor, only one for all backends and ddp
- deterministic, enforce alignment with pytorch ddp
```
  53553474
22 Sep, 2020 3 commits
- [chore] Documentation fixes, no more ref issues and more API fields (#103) · 7c5203eb
  Benjamin Lefaudeux authored Sep 22, 2020
```
* various fixes, no more issues with `make html` and more API fields should be populated
```
  7c5203eb
- [bug] Make OSS Gloo-compliant (#102) · b488dcfa
  Benjamin Lefaudeux authored Sep 22, 2020
```
* Broadcasting grad-enabled tensors is forbidden in Gloo, because this is not differentiable. Workaround
```
  b488dcfa
- [chore] OSS doc (#101) · d80c38f9
  Benjamin Lefaudeux authored Sep 22, 2020
```
* Doc extensions to some APIs
* FIx the benchmark and tutorial
```
  d80c38f9
17 Sep, 2020 5 commits

Multi-process pipe (#90) · 63f7796a

Tom Birch authored Sep 17, 2020

Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
* Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
* Added support for lazy construction of modules (see lazy_construction for an example)
* Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
* Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess

63f7796a

[feat] Sharded DDP - small refactor and new features (#97) · 49a198c9

Benjamin Lefaudeux authored Sep 17, 2020

- rename oss_ddp to ShardedDataParallel
- some refactoring
- ShardedDataParallel owns the sharded optimizer, exposed if need be
- some small perf bumps

49a198c9

[docs] update installation instructions now that we have a pip package (#95) · 2ddce57f
msbaines authored Sep 17, 2020

2ddce57f
[bug] OSS README typo #2 · db047d13
Benjamin Lefaudeux authored Sep 17, 2020

db047d13
[bug] hotfixes to OSS Readme (#94) · 656828bb
Benjamin Lefaudeux authored Sep 17, 2020

656828bb