Commits · 4d8f2e59b5a8ac78e514d736a18c9eeae22b8014 · OpenDAS / fairscale

01 Dec, 2020 1 commit
- [fix][Pipe] fallback for Pipe tests on internal pytorch numbering (#216) · 4d8f2e59
  Benjamin Lefaudeux authored Nov 30, 2020
```
* fallback on internal pytorch numbering
```
  4d8f2e59
30 Nov, 2020 1 commit
- [fix] OSS ad-hoc perf regression fix, more inconsistent than expected (#214) · 835ecb0c
  Benjamin Lefaudeux authored Nov 30, 2020
  
  835ecb0c
27 Nov, 2020 1 commit
- [doc] Fixing relative html links (#212) · d09f5aa2
  Benjamin Lefaudeux authored Nov 26, 2020
```
Fixing the relative positions of the html docs.
```
  d09f5aa2
26 Nov, 2020 1 commit
- [fix] Adding a GradScaler import guard for amp with pytorch 1.5 (#210) · 8e85ce8c
  Benjamin Lefaudeux authored Nov 25, 2020
  
  8e85ce8c
24 Nov, 2020 1 commit
- [doc] make the basic example more "compilable" (#207) · 7a062894
  Stas Bekman authored Nov 24, 2020
```
* make the basic example usable out of the box
* clarify
```
  7a062894
22 Nov, 2020 1 commit

[fix] More robust stats for regression testing (#204) · 2b121242

Benjamin Lefaudeux authored Nov 22, 2020

* testing median and MAD

* synchronize on kernels to make sure that we're measuring the actual completion time

* adjusting the circleci threshold, not that the speed has regressed but because we measure proper cuda execution time

2b121242

21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

20 Nov, 2020 1 commit
- [fix] make fairscale.utils a proper package (#200) · 35d4129f
  msbaines authored Nov 19, 2020
  
  35d4129f
19 Nov, 2020 4 commits
- [fix] make extension build robust to include path (#196) · 3b83ef51
  msbaines authored Nov 19, 2020
```
Fixes #190
```
  3b83ef51
- [test] run moe mpi tests using torch_pg (#197) · cd496b36
  msbaines authored Nov 19, 2020
  
  cd496b36
- [fix] Reverting a change which slipped in #188 (#198) · ba367d39
  Benjamin Lefaudeux authored Nov 18, 2020
```
* reverting a change which slipped in #188
```
  ba367d39
- [feat] Add CPU support for pipe.py benchmarks (#188) · a842a927
  Yuanyuan (Ana) Shen authored Nov 18, 2020
```
* Add CPU support for pipe.py benchmarks, CUDA-free
```
  a842a927
18 Nov, 2020 2 commits
- fix bug (#193) · f80b303c
  Tom Birch authored Nov 17, 2020
  
  f80b303c
- [feat] ShardedOptim: Distributed Grad Scaler (for torch AMP) (#182) · d85acf72
  Benjamin Lefaudeux authored Nov 17, 2020
```
* adding a shard-aware GradScaler wrap, credits to Sean Naren for the idea
* adding stubs & explanations in the documentation
```
  d85acf72
17 Nov, 2020 1 commit

[doc] add AdaScale API doc (#191) · 587b707d

Min Xu authored Nov 17, 2020

- removed experimental warning as we have validated it on cifar and
imagenet, transformer is looking good so far too.
- fixed API doc formatting
- make it consistent with the other code in the repo
- tested by making the doc locally and inspect the results

587b707d

16 Nov, 2020 1 commit
- [feat] OSS-aware clip grads, bridge sharded states (#167) · ade312c4
  Benjamin Lefaudeux authored Nov 16, 2020
```
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
```
  ade312c4
12 Nov, 2020 2 commits
- [fix] Pure cpu support for benchmarks/oss.py (#185) · 2fe93203
  Yuanyuan (Ana) Shen authored Nov 12, 2020
```
* now works on a machine without cuda, easier to debug and quick test
```
  2fe93203
- [refactor] moe: cleanup code to be more readable (#186) · 34df6069
  msbaines authored Nov 11, 2020
  
  34df6069
11 Nov, 2020 2 commits
- [fix] moe: fix bug for multiple experts per-gpu case (#184) · 317c0945
  msbaines authored Nov 11, 2020
  
  317c0945
- [refactor] moe: remove G dimension (#183) · 89176e34
  msbaines authored Nov 11, 2020
  
  89176e34
10 Nov, 2020 1 commit

Single-process control via PipeRPCWrapper (#156) · 5d4f50fb

Tom Birch authored Nov 10, 2020

Adds support for:
* Reused layers (e.g. for weight sharing)
* Lazily-constructed layers
* Single-process control via PipeRPCWrapper
* PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive

Also added examples for multi-process and PipeRPCWrapper

5d4f50fb

06 Nov, 2020 2 commits
- [fix] OSS tests - remove concurrent dist inits (#177) · 543d5693
  Benjamin Lefaudeux authored Nov 06, 2020
  
  543d5693
- [feature] Add a torch AMP benchmark option and test job (#175) · cc766aa5
  Benjamin Lefaudeux authored Nov 05, 2020
```
* oss benchmark: add an --amp option
* add a circleCI test
```
  cc766aa5
04 Nov, 2020 1 commit
- [feat] oss: add rank_local_state_dict staticmethod (#174) · 0d1f058b
  msbaines authored Nov 04, 2020
  
  0d1f058b
30 Oct, 2020 2 commits
- add warning to adascale before it is validated (#169) · b5ccedc0
  Min Xu authored Oct 30, 2020
  
  b5ccedc0
- [chore] add circleci testing of torch==1.5.1 (#172) · 4247f602
  msbaines authored Oct 29, 2020
  
  4247f602
29 Oct, 2020 1 commit
- [chore] update to torch v1.7.0 (#171) · ace61a41
  msbaines authored Oct 28, 2020
  
  ace61a41
28 Oct, 2020 2 commits
- [chore] update isort to 5.6.4 (#170) · ea9876e3
  msbaines authored Oct 27, 2020
  
  ea9876e3
- [refactor] moe: use all_to_all_single (#168) · 2108f20e
  msbaines authored Oct 27, 2020
  
  2108f20e
26 Oct, 2020 1 commit
- [chore] consolidate tool's config (#150) · c5e5ff78
  Min Xu authored Oct 26, 2020
  
  c5e5ff78
23 Oct, 2020 3 commits
- [feat][minor] OSS Benchmark - add a debug option to add some tensor dumps (#166) · 34f35fba
  Benjamin Lefaudeux authored Oct 23, 2020
```
* Some ease of use in the benchmark tool, add a debug option
```
  34f35fba
- [refactor] OSS - broadcasts - getting rid of the while loop (#165) · a31b08a5
  Benjamin Lefaudeux authored Oct 23, 2020
```
* small refactor, getting rid of the while loop
```
  a31b08a5
- [feat] moe: add support for multiple experts per device (#161) · 339cf060
  msbaines authored Oct 23, 2020
  
  339cf060
22 Oct, 2020 3 commits
- fix index links · 95ddbc19
  Vittorio Caggiano authored Oct 22, 2020
  
  95ddbc19
- Update index.rst · efffd7ed
  Vittorio Caggiano authored Oct 22, 2020
```
fix broken link
```
  efffd7ed
- [bugfix] hotfix oss benchmark regression testing (#163) · 6be7f973
  Benjamin Lefaudeux authored Oct 21, 2020
  
  6be7f973
21 Oct, 2020 4 commits

[fix] fixing adascale all_reduce (#155) · 6802ad49

Min Xu authored Oct 21, 2020

- Aurick noticed this bug and I ran into it yesterday
- after the fix, our cifar training shows same gain values from
  different replics now:

```
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3512124098087777
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3512124098087777
20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000600 fwd 0:00:00.003678 loss 0:00:00.000086 bwd 0:00:00.314158 update 0:00:00.002132 rest 0:00:00.000399
20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000643 fwd 0:00:00.003460 loss 0:00:00.000084 bwd 0:00:00.314678 update 0:00:00.002001 rest 0:00:00.000408
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3514997779980324
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3514997779980324
20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000732 fwd 0:00:00.003689 loss 0:00:00.000086 bwd 0:00:00.314176 update 0:00:00.002146 rest 0:00:00.000397
20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000646 fwd 0:00:00.003542 loss 0:00:00.000089 bwd 0:00:00.314549 update 0:00:00.001956 rest 0:00:00.000392
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.352149646693932
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.352149646693932
```

6802ad49

[feature] OSS: Use MNIST to benchmark (#159) · 6f8a8652

Benjamin Lefaudeux authored Oct 21, 2020

* switching to MNIST
* updating the reference values, should be good to go
* download dataset once for all processes

6f8a8652

Update index.rst · 577dcd98
Vittorio Caggiano authored Oct 21, 2020
```
fix max depth
```
577dcd98
Update index.rst · eb2cabdc
Vittorio Caggiano authored Oct 21, 2020
```
fix maxdepth
```
eb2cabdc