- 28 Oct, 2020 1 commit
-
-
msbaines authored
-
- 26 Oct, 2020 1 commit
-
-
Min Xu authored
-
- 23 Oct, 2020 3 commits
-
-
Benjamin Lefaudeux authored
* Some ease of use in the benchmark tool, add a debug option
-
Benjamin Lefaudeux authored
* small refactor, getting rid of the while loop
-
msbaines authored
-
- 22 Oct, 2020 3 commits
-
-
Vittorio Caggiano authored
-
Vittorio Caggiano authored
fix broken link
-
Benjamin Lefaudeux authored
-
- 21 Oct, 2020 7 commits
-
-
Min Xu authored
- Aurick noticed this bug and I ran into it yesterday - after the fix, our cifar training shows same gain values from different replics now: ``` 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3512124098087777 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3512124098087777 20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000600 fwd 0:00:00.003678 loss 0:00:00.000086 bwd 0:00:00.314158 update 0:00:00.002132 rest 0:00:00.000399 20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000643 fwd 0:00:00.003460 loss 0:00:00.000084 bwd 0:00:00.314678 update 0:00:00.002001 rest 0:00:00.000408 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3514997779980324 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3514997779980324 20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000732 fwd 0:00:00.003689 loss 0:00:00.000086 bwd 0:00:00.314176 update 0:00:00.002146 rest 0:00:00.000397 20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000646 fwd 0:00:00.003542 loss 0:00:00.000089 bwd 0:00:00.314549 update 0:00:00.001956 rest 0:00:00.000392 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.352149646693932 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.352149646693932 ```
-
Benjamin Lefaudeux authored
* switching to MNIST * updating the reference values, should be good to go * download dataset once for all processes
-
Vittorio Caggiano authored
fix max depth
-
Vittorio Caggiano authored
fix maxdepth
-
Vittorio Caggiano authored
-
Vittorio Caggiano authored
* wip_example * [wip]mnist_pipe_example * [wip]mnist_pipe_example * [wip]mnist_pipe_example * [wip]mnist_pipe_example * [wip]mnist_oss_example * working prototype * added tutorial script * update tutorial * Update mnist_test_oss.py * Update mnist_test_oss.py * Update mnist_test_oss.py * Update mnist_test_pipe.py * Update tutorial_oss.py * Update tutorial_pipe.py * Update tutorial_pipe.py * Update mnist_test_oss.py * Update tutorial_pipe.py * Update mnist_test_pipe.py * Update tutorial_pipe.py * fix black * fix flacke8 * general fixes * add example oss+pipe * fix isort * Update mnist_test_pipe.py * fix black Co-authored-by:Vittorio Caggiano <caggiano@devfair0253.h2.fair>
-
msbaines authored
-
- 20 Oct, 2020 4 commits
-
-
Benjamin Lefaudeux authored
* Minor, ease of life to debug and makes it possible to test a host of models with the same code
-
Benjamin Lefaudeux authored
* small refactor, code cleanup * broadcast tensor .data attribute directly
-
Min Xu authored
- fixed typing - make it run less often to reduce CI time testing: run it in a loop make sure it is run in the right frequency.
-
Min Xu authored
- close #143
-
- 18 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* fixing the readme for oss
-
- 17 Oct, 2020 2 commits
-
-
Benjamin Lefaudeux authored
* adding a cpu option * adjust the reference loss
-
msbaines authored
-
- 16 Oct, 2020 4 commits
-
-
Min Xu authored
* [fix] fixing circleCI for AdaScale - ran black, isort, flake8, mypy * more fix
-
Aurick Qiao authored
* Add implementation of AdaScale * add adascale docs
-
msbaines authored
The expert annotation is used by clip_grads and DDP.
-
msbaines authored
-
- 15 Oct, 2020 1 commit
-
-
msbaines authored
-
- 14 Oct, 2020 3 commits
-
-
Benjamin Lefaudeux authored
-
Benjamin Lefaudeux authored
* fixing the issue wrt Apex, validated with Latte, Classy would need another pass
-
msbaines authored
-
- 10 Oct, 2020 1 commit
-
-
Benjamin Lefaudeux authored
* bugfix * adjust default non-regression loss, not all_reduced now
-
- 09 Oct, 2020 2 commits
-
-
Benjamin Lefaudeux authored
* wrapping the model in DDP in the tutorial * typo
-
Benjamin Lefaudeux authored
More realistic benchmarks, comparing apples to apples. DDP/OSS+DDP/OSS+SDP
-
- 08 Oct, 2020 4 commits
-
-
Benjamin Lefaudeux authored
* new unit test to catch rank issues in OSS
-
msbaines authored
Currently only implemented for a single process and expert.
-
ngoyal2707 authored
authored-by:Naman Goyal <namangoyal@learnfair0755.h2.fair>
-
Min Xu authored
* Add unittest for checkpoint & DDP - this change adds test cases to reproduce the error with checkpoint & DDP - mandeep mentioned that there is also deadlock in this case, but this change doesn't cover that. - we cover cases where weight sharing is OK - however, same module multiple checkpoint or find_unused_parameters are both not OK * added norm checks
-
- 06 Oct, 2020 2 commits
-
-
Benjamin Lefaudeux authored
Same bucketing strategy for OSS and SDP: sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed
-
msbaines authored
-
- 05 Oct, 2020 1 commit
-
-
msbaines authored
-