Commits · 2108f20e3e5f1054c18c5c7e20fed2f36a233349 · OpenDAS / fairscale

28 Oct, 2020 1 commit
- [refactor] moe: use all_to_all_single (#168) · 2108f20e
  msbaines authored Oct 27, 2020
  
  2108f20e
26 Oct, 2020 1 commit
- [chore] consolidate tool's config (#150) · c5e5ff78
  Min Xu authored Oct 26, 2020
  
  c5e5ff78
23 Oct, 2020 3 commits
- [feat][minor] OSS Benchmark - add a debug option to add some tensor dumps (#166) · 34f35fba
  Benjamin Lefaudeux authored Oct 23, 2020
```
* Some ease of use in the benchmark tool, add a debug option
```
  34f35fba
- [refactor] OSS - broadcasts - getting rid of the while loop (#165) · a31b08a5
  Benjamin Lefaudeux authored Oct 23, 2020
```
* small refactor, getting rid of the while loop
```
  a31b08a5
- [feat] moe: add support for multiple experts per device (#161) · 339cf060
  msbaines authored Oct 23, 2020
  
  339cf060
22 Oct, 2020 3 commits
- fix index links · 95ddbc19
  Vittorio Caggiano authored Oct 22, 2020
  
  95ddbc19
- Update index.rst · efffd7ed
  Vittorio Caggiano authored Oct 22, 2020
```
fix broken link
```
  efffd7ed
- [bugfix] hotfix oss benchmark regression testing (#163) · 6be7f973
  Benjamin Lefaudeux authored Oct 21, 2020
  
  6be7f973
21 Oct, 2020 7 commits

[fix] fixing adascale all_reduce (#155) · 6802ad49

Min Xu authored Oct 21, 2020

- Aurick noticed this bug and I ran into it yesterday
- after the fix, our cifar training shows same gain values from
  different replics now:

```
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3512124098087777
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3512124098087777
20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000600 fwd 0:00:00.003678 loss 0:00:00.000086 bwd 0:00:00.314158 update 0:00:00.002132 rest 0:00:00.000399
20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000643 fwd 0:00:00.003460 loss 0:00:00.000084 bwd 0:00:00.314678 update 0:00:00.002001 rest 0:00:00.000408
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3514997779980324
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3514997779980324
20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000732 fwd 0:00:00.003689 loss 0:00:00.000086 bwd 0:00:00.314176 update 0:00:00.002146 rest 0:00:00.000397
20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000646 fwd 0:00:00.003542 loss 0:00:00.000089 bwd 0:00:00.314549 update 0:00:00.001956 rest 0:00:00.000392
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.352149646693932
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.352149646693932
```

6802ad49

[feature] OSS: Use MNIST to benchmark (#159) · 6f8a8652

Benjamin Lefaudeux authored Oct 21, 2020

* switching to MNIST
* updating the reference values, should be good to go
* download dataset once for all processes

6f8a8652

Update index.rst · 577dcd98
Vittorio Caggiano authored Oct 21, 2020
```
fix max depth
```
577dcd98
Update index.rst · eb2cabdc
Vittorio Caggiano authored Oct 21, 2020
```
fix maxdepth
```
eb2cabdc
Update index.rst · 16b50272
Vittorio Caggiano authored Oct 21, 2020

16b50272

Classification Examples of oss + pipe | tutorials/doc update (#119) · 53043d26

Vittorio Caggiano authored Oct 21, 2020



* wip_example

* [wip]mnist_pipe_example

* [wip]mnist_pipe_example

* [wip]mnist_pipe_example

* [wip]mnist_pipe_example

* [wip]mnist_oss_example

* working prototype

* added tutorial script

* update tutorial

* Update mnist_test_oss.py

* Update mnist_test_oss.py

* Update mnist_test_oss.py

* Update mnist_test_pipe.py

* Update tutorial_oss.py

* Update tutorial_pipe.py

* Update tutorial_pipe.py

* Update mnist_test_oss.py

* Update tutorial_pipe.py

* Update mnist_test_pipe.py

* Update tutorial_pipe.py

* fix black

* fix flacke8

* general fixes

* add example oss+pipe

* fix isort

* Update mnist_test_pipe.py

* fix black
Co-authored-by: Vittorio Caggiano <caggiano@devfair0253.h2.fair>

53043d26

[test] moe: add a more thorough MOELayer routing test (#151) · c6d9be79
msbaines authored Oct 20, 2020

c6d9be79

20 Oct, 2020 4 commits
- [feat][minor] OSS benchmark - pick the model via args (#152) · 49a3d9bc
  Benjamin Lefaudeux authored Oct 20, 2020
```
* Minor, ease of life to debug and makes it possible to test a host of models with the same code
```
  49a3d9bc
- [refactor][minor] OSS - small refactor of the bucketing (#153) · 61bb32b5
  Benjamin Lefaudeux authored Oct 20, 2020
```
* small refactor, code cleanup
* broadcast tensor .data attribute directly
```
  61bb32b5
- [test] fine tune test for checkpoint & DDP (#148) · 66b2b514
  Min Xu authored Oct 20, 2020
```
- fixed typing
- make it run less often to reduce CI time

testing: run it in a loop make sure it is run in the right frequency.
```
  66b2b514
- [cleanup] mypy adascale (#149) · a0042113
  Min Xu authored Oct 20, 2020
```
- close #143
```
  a0042113
18 Oct, 2020 1 commit
- [docs][minor] fixing the readme example for oss (#147) · 58e97aa6
  Benjamin Lefaudeux authored Oct 17, 2020
```
* fixing the readme for oss
```
  58e97aa6
17 Oct, 2020 2 commits
- [feat][minor] OSS: benchmark - adding a cpu option (#144) · 10062e58
  Benjamin Lefaudeux authored Oct 16, 2020
```
* adding a cpu option
* adjust the reference loss
```
  10062e58
- [cleanup] moe: rename moelayer.py to moe_layer.py (#141) · 61234360
  msbaines authored Oct 16, 2020
  
  61234360
16 Oct, 2020 4 commits
- [fix] fixing circleCI for AdaScale (#142) · a65fc83e
  Min Xu authored Oct 16, 2020
```
* [fix] fixing circleCI for AdaScale

- ran black, isort, flake8, mypy

* more fix
```
  a65fc83e
- [feat] Add implementation of AdaScale (#139) · 64d1e312
  Aurick Qiao authored Oct 16, 2020
```
* Add implementation of AdaScale

* add adascale docs
```
  64d1e312
- [feat] moe: annotate expert params (#140) · ee88bb19
  msbaines authored Oct 16, 2020
```
The expert annotation is used by clip_grads and DDP.
```
  ee88bb19
- [feat] moe: add all_to_all backward support (#137) · d99c445a
  msbaines authored Oct 16, 2020
  
  d99c445a
15 Oct, 2020 1 commit
- [chore] create v0.0.3 (#138) · 1e6c547a
  msbaines authored Oct 14, 2020
  
  1e6c547a
14 Oct, 2020 3 commits
- [feat] OSS: adding a --profile option to the benchmark (#135) · 34915bf8
  Benjamin Lefaudeux authored Oct 14, 2020
  
  34915bf8
- [bugfix] OSS + Apex (#136) · 37c686e7
  Benjamin Lefaudeux authored Oct 14, 2020
```
* fixing the issue wrt Apex, validated with Latte, Classy would need another pass
```
  37c686e7
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
10 Oct, 2020 1 commit
- [bugfix] OSS no reduce loss (#133) · 177151e0
  Benjamin Lefaudeux authored Oct 09, 2020
```
* bugfix
* adjust default non-regression loss, not all_reduced now
```
  177151e0
09 Oct, 2020 2 commits
- [minor] OSS doc fix - add the DDP wrap (#131) · 5220f89b
  Benjamin Lefaudeux authored Oct 09, 2020
```
* wrapping the model in DDP in the tutorial

* typo
```
  5220f89b
- [minor] OSS: bring DDP in the benchmark (#130) · bfd88cad
  Benjamin Lefaudeux authored Oct 08, 2020
```
More realistic benchmarks, comparing apples to apples. DDP/OSS+DDP/OSS+SDP
```
  bfd88cad
08 Oct, 2020 4 commits

[fix] OSS unit test to check data group (#129) · 81ac5b28
Benjamin Lefaudeux authored Oct 08, 2020
```
* new unit test to catch rank issues in OSS
```
81ac5b28
[feat] moe: initial implementation of MOELayer (#128) · 22ff665d
msbaines authored Oct 08, 2020
```
Currently only implemented for a single process and expert.
```
22ff665d
[fix] megatron + oss (#127) · 82dbd5d8
ngoyal2707 authored Oct 08, 2020
```
authored-by: Naman Goyal <namangoyal@learnfair0755.h2.fair>
```
82dbd5d8

[test] Add unittest for checkpoint & DDP (#126) · 6658be22

Min Xu authored Oct 07, 2020

* Add unittest for checkpoint & DDP

- this change adds test cases to reproduce the error with checkpoint & DDP
- mandeep mentioned that there is also deadlock in this case, but this
  change doesn't cover that.
- we cover cases where weight sharing is OK
- however, same module multiple checkpoint or find_unused_parameters are
  both not OK

* added norm checks

6658be22

06 Oct, 2020 2 commits

[feat] OSS/SDP : bucketing (#122) · 341d8b2b

Benjamin Lefaudeux authored Oct 05, 2020

Same bucketing strategy for OSS and SDP:
sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed

341d8b2b

[refactor] moe: simplify logic removing top expert (#125) · 6e7ad798
msbaines authored Oct 05, 2020

6e7ad798

05 Oct, 2020 1 commit
- [fix] moe: fix Top2Gate to work on GPU (#124) · 662667d0
  msbaines authored Oct 05, 2020
  
  662667d0