Commits · 543d56935ee39c9b288d78f577eae8a2199bca7d · OpenDAS / fairscale

"torchvision/csrc/vscode:/vscode.git/clone" did not exist on "4491ca2e8585375ccbab43a42f8af5c664414090"

28 Oct, 2020 1 commit
- [refactor] moe: use all_to_all_single (#168) · 2108f20e
  msbaines authored Oct 27, 2020
  
  2108f20e
23 Oct, 2020 1 commit
- [refactor] OSS - broadcasts - getting rid of the while loop (#165) · a31b08a5
  Benjamin Lefaudeux authored Oct 23, 2020
```
* small refactor, getting rid of the while loop
```
  a31b08a5
21 Oct, 2020 1 commit

[fix] fixing adascale all_reduce (#155) · 6802ad49

Min Xu authored Oct 21, 2020

- Aurick noticed this bug and I ran into it yesterday
- after the fix, our cifar training shows same gain values from
  different replics now:

```
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3512124098087777
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3512124098087777
20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000600 fwd 0:00:00.003678 loss 0:00:00.000086 bwd 0:00:00.314158 update 0:00:00.002132 rest 0:00:00.000399
20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000643 fwd 0:00:00.003460 loss 0:00:00.000084 bwd 0:00:00.314678 update 0:00:00.002001 rest 0:00:00.000408
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3514997779980324
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3514997779980324
20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000732 fwd 0:00:00.003689 loss 0:00:00.000086 bwd 0:00:00.314176 update 0:00:00.002146 rest 0:00:00.000397
20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000646 fwd 0:00:00.003542 loss 0:00:00.000089 bwd 0:00:00.314549 update 0:00:00.001956 rest 0:00:00.000392
20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.352149646693932
20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.352149646693932
```

6802ad49

20 Oct, 2020 2 commits
- [test] fine tune test for checkpoint & DDP (#148) · 66b2b514
  Min Xu authored Oct 20, 2020
```
- fixed typing
- make it run less often to reduce CI time

testing: run it in a loop make sure it is run in the right frequency.
```
  66b2b514
- [cleanup] mypy adascale (#149) · a0042113
  Min Xu authored Oct 20, 2020
```
- close #143
```
  a0042113
14 Oct, 2020 1 commit
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
02 Oct, 2020 1 commit
- [feat] moe: initial implementation of Top2Gating (#118) · 7815f6f3
  msbaines authored Oct 01, 2020
  
  7815f6f3
17 Sep, 2020 1 commit

Multi-process pipe (#90) · 63f7796a

Tom Birch authored Sep 17, 2020

Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
* Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
* Added support for lazy construction of modules (see lazy_construction for an example)
* Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
* Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess

63f7796a

16 Sep, 2020 1 commit
- [cleanup] fix pre-commit mypy issues (#87) · 4a874a6b
  msbaines authored Sep 16, 2020
  
  4a874a6b
03 Sep, 2020 1 commit

Add grad scaler (#48) · b6a5e634

Jun Ru Anderson authored Sep 03, 2020



Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

b6a5e634

27 Aug, 2020 1 commit

[fix] optim/oss: fix state cast (#56) · fb49b515

msbaines authored Aug 27, 2020

Workaround PyTorch bug that casts state (pytorch/pytorch#43706).

Copied from https://github.com/pytorch/fairseq/blob/v0.9.0/fairseq/optim/fp16_optimizer.py#L251-L268

fb49b515

14 Aug, 2020 2 commits
- [cleanup] get 100% coverage on oss.py (#38) · 3427a039
  msbaines authored Aug 13, 2020
```
authored-by: Mandeep Singh Baines <msb@fb.com>
```
  3427a039
- [test] using PyTorch v1.6 for Lint checks (#36) · b35a3d3f
  msbaines authored Aug 13, 2020
  
  b35a3d3f
31 Jul, 2020 3 commits
- [doc] Fix copyright in stubs (#13) · e10de4b5
  msbaines authored Jul 30, 2020
  
  e10de4b5
- [feat] Model parallel (#3) · 30f5009a
  Tom Birch authored Jul 22, 2020
  
  30f5009a
- [fix] add TransformerEncoderLayer to stubs (#5) · 63b5b166
  Jun Ru Anderson authored Jul 21, 2020
```
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>
```
  63b5b166
08 Jul, 2020 1 commit
- Initial commit · 0cd65242
  Mandeep Singh Baines authored Jul 07, 2020
  
  0cd65242