Commits · 1db8bbdae190b4601192fa0c74af1abe62e23643 · OpenDAS / fairscale

22 Nov, 2020 1 commit

[fix] More robust stats for regression testing (#204) · 2b121242

Benjamin Lefaudeux authored Nov 22, 2020

* testing median and MAD

* synchronize on kernels to make sure that we're measuring the actual completion time

* adjusting the circleci threshold, not that the speed has regressed but because we measure proper cuda execution time

2b121242

21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

19 Nov, 2020 2 commits
- [fix] Reverting a change which slipped in #188 (#198) · ba367d39
  Benjamin Lefaudeux authored Nov 18, 2020
```
* reverting a change which slipped in #188
```
  ba367d39
- [feat] Add CPU support for pipe.py benchmarks (#188) · a842a927
  Yuanyuan (Ana) Shen authored Nov 18, 2020
```
* Add CPU support for pipe.py benchmarks, CUDA-free
```
  a842a927
18 Nov, 2020 1 commit

[feat] ShardedOptim: Distributed Grad Scaler (for torch AMP) (#182) · d85acf72

Benjamin Lefaudeux authored Nov 17, 2020

* adding a shard-aware GradScaler wrap, credits to Sean Naren for the idea
* adding stubs & explanations in the documentation

d85acf72

16 Nov, 2020 1 commit
- [feat] OSS-aware clip grads, bridge sharded states (#167) · ade312c4
  Benjamin Lefaudeux authored Nov 16, 2020
```
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
```
  ade312c4
12 Nov, 2020 1 commit
- [fix] Pure cpu support for benchmarks/oss.py (#185) · 2fe93203
  Yuanyuan (Ana) Shen authored Nov 12, 2020
```
* now works on a machine without cuda, easier to debug and quick test
```
  2fe93203
10 Nov, 2020 1 commit

Single-process control via PipeRPCWrapper (#156) · 5d4f50fb

Tom Birch authored Nov 10, 2020

Adds support for:
* Reused layers (e.g. for weight sharing)
* Lazily-constructed layers
* Single-process control via PipeRPCWrapper
* PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive

Also added examples for multi-process and PipeRPCWrapper

5d4f50fb

06 Nov, 2020 1 commit
- [feature] Add a torch AMP benchmark option and test job (#175) · cc766aa5
  Benjamin Lefaudeux authored Nov 05, 2020
```
* oss benchmark: add an --amp option
* add a circleCI test
```
  cc766aa5
28 Oct, 2020 1 commit
- [chore] update isort to 5.6.4 (#170) · ea9876e3
  msbaines authored Oct 27, 2020
  
  ea9876e3
23 Oct, 2020 1 commit
- [feat][minor] OSS Benchmark - add a debug option to add some tensor dumps (#166) · 34f35fba
  Benjamin Lefaudeux authored Oct 23, 2020
```
* Some ease of use in the benchmark tool, add a debug option
```
  34f35fba
21 Oct, 2020 1 commit

[feature] OSS: Use MNIST to benchmark (#159) · 6f8a8652

Benjamin Lefaudeux authored Oct 21, 2020

* switching to MNIST
* updating the reference values, should be good to go
* download dataset once for all processes

6f8a8652

20 Oct, 2020 1 commit
- [feat][minor] OSS benchmark - pick the model via args (#152) · 49a3d9bc
  Benjamin Lefaudeux authored Oct 20, 2020
```
* Minor, ease of life to debug and makes it possible to test a host of models with the same code
```
  49a3d9bc
17 Oct, 2020 1 commit
- [feat][minor] OSS: benchmark - adding a cpu option (#144) · 10062e58
  Benjamin Lefaudeux authored Oct 16, 2020
```
* adding a cpu option
* adjust the reference loss
```
  10062e58
14 Oct, 2020 1 commit
- [feat] OSS: adding a --profile option to the benchmark (#135) · 34915bf8
  Benjamin Lefaudeux authored Oct 14, 2020
  
  34915bf8
10 Oct, 2020 1 commit
- [bugfix] OSS no reduce loss (#133) · 177151e0
  Benjamin Lefaudeux authored Oct 09, 2020
```
* bugfix
* adjust default non-regression loss, not all_reduced now
```
  177151e0
09 Oct, 2020 1 commit
- [minor] OSS: bring DDP in the benchmark (#130) · bfd88cad
  Benjamin Lefaudeux authored Oct 08, 2020
```
More realistic benchmarks, comparing apples to apples. DDP/OSS+DDP/OSS+SDP
```
  bfd88cad
06 Oct, 2020 1 commit

[feat] OSS/SDP : bucketing (#122) · 341d8b2b

Benjamin Lefaudeux authored Oct 05, 2020

Same bucketing strategy for OSS and SDP:
sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed

341d8b2b

29 Sep, 2020 1 commit
- [ShardedDDP] Sync buffers + small cleanup (#112) · 79ded821
  Benjamin Lefaudeux authored Sep 28, 2020
```
- adding the buffer broadcast option
- minor cleanup in shardedDDP
```
  79ded821
24 Sep, 2020 1 commit

[fix] OSS benchmark cleanup (#109) · 53553474

Benjamin Lefaudeux authored Sep 24, 2020

- small benchmark refactor, only one for all backends and ddp
- deterministic, enforce alignment with pytorch ddp

53553474

22 Sep, 2020 2 commits
- [bug] Make OSS Gloo-compliant (#102) · b488dcfa
  Benjamin Lefaudeux authored Sep 22, 2020
```
* Broadcasting grad-enabled tensors is forbidden in Gloo, because this is not differentiable. Workaround
```
  b488dcfa
- [chore] OSS doc (#101) · d80c38f9
  Benjamin Lefaudeux authored Sep 22, 2020
```
* Doc extensions to some APIs
* FIx the benchmark and tutorial
```
  d80c38f9
17 Sep, 2020 2 commits

Multi-process pipe (#90) · 63f7796a

Tom Birch authored Sep 17, 2020

Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
* Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
* Added support for lazy construction of modules (see lazy_construction for an example)
* Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
* Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess

63f7796a

[feat] Sharded DDP - small refactor and new features (#97) · 49a198c9

Benjamin Lefaudeux authored Sep 17, 2020

- rename oss_ddp to ShardedDataParallel
- some refactoring
- ShardedDataParallel owns the sharded optimizer, exposed if need be
- some small perf bumps

49a198c9

16 Sep, 2020 1 commit
- [cleanup] fix pre-commit mypy issues (#87) · 4a874a6b
  msbaines authored Sep 16, 2020
  
  4a874a6b
09 Sep, 2020 1 commit

[feat] OSS flatten state dict (#65) · 4f597233

Benjamin Lefaudeux authored Sep 09, 2020

Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading

4f597233

03 Sep, 2020 3 commits

[feat] Add a memory usage regression test to the OSS benchmark (#62) · ee38e1e0

Benjamin Lefaudeux authored Sep 03, 2020

* Aligning the optimizer state dict with what PyTorch expects

* Adding a check on the dict keys, ensure that `state` and `param_groups` are there

* after installing the specific isort, black and all, one liner to please the linter..

* Adding some measurement of the memory consumption while training + checkpointing

* mandatory lintfix commit

* brainfart, reset the memory use counter at the beginning of the training in case two of them are run in a row

* move reset stats call, hotfix

* move the optimizer to rmsprop, more stateful and still used in CV

* trying to figure out a sigsev in circleci

ee38e1e0

Add grad scaler (#48) · b6a5e634

Jun Ru Anderson authored Sep 03, 2020



Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

b6a5e634

[fix] OSS pytorch-compliant state dict (#61) · 1d1d15ea

Benjamin Lefaudeux authored Sep 03, 2020

* Aligning the optimizer state dict with what PyTorch expects

* Adding a check on the dict keys, ensure that `state` and `param_groups` are there

* after installing the specific isort, black and all, one liner to please the linter..

1d1d15ea

28 Aug, 2020 1 commit

[test] specify chunks for pipe/transformer benchmark (#52) · d1d74413

Jun Ru Anderson authored Aug 28, 2020



* specify chunks for pipe/transformer benchmark

Set chunks to be equal to len(balance) for pipe/transformer benchmark. Will update words per second and memory usage checks in next commit (must test on CircleCI to find appropriate values)

* change benchmark words per second and memory usage

Did six runs for words-per-second, with results: 9144.40, 9163.91, 9993.01, 9082.82, 9155.09, 9000.67
Peak allocated bytes per device (which does not change between runs) were 193206272, 645632, 562688, 92688384 for devices 0, 1, 2 and 3, respectively

* increase batch size

batch size was small enough that the GPU's computing power was not the bottleneck, slowing training and specifically making more chunks slower. Increasing batch size has therefore increased training speed

* update benchmark numbers

ran six times, with wps 36917.44, 36797.65, 37006.03, 36872.84, 37129.31, 37003.31 and peak allocated bytes 4061909504, 4050944, 10427392, 2031824896 for devices 0,1,2 and 3 respectively.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

d1d74413

22 Aug, 2020 1 commit

[feat] optimizer state scaling (#44) · 5251a69a

Jun Ru Anderson authored Aug 21, 2020



Implement scaling of optimizer state when using pure-fp16 training to avoid underflow. Update benchmark to use pure-fp16. Modify state_dict methods to store and load the optimizer state scale.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

5251a69a

21 Aug, 2020 2 commits

[feat] Simple macro OSS benchmark (#47) · 46c3776b

Benjamin Lefaudeux authored Aug 21, 2020



* initial commit, dummy training loop, pure pytorch but not DDP

* probably slightly broken, but rough DDP benchmark run

* adding the torchvision requirement for testing

* brainfart

* reduce the loss, do something slightly distributed

* Some cleanup, distributing the training on two GPUs

* some cleanup + adding a vanilla run, still not good to go

* less silly defaults, gtg for a start I think

* smaller batch to fit the smaller gpus used in the circleci rigs

* Adding some options for the benchmark, and regression testing

* [test] set torch seed for Adam tests (#49)

Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

* linting, I really need to automate this isort insanity
Co-authored-by: Jun Ru Anderson <33384298+andersonic@users.noreply.github.com>
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

46c3776b

[test] set torch seed for Adam tests (#49) · 0e8c2a96

Jun Ru Anderson authored Aug 21, 2020



Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

0e8c2a96

18 Aug, 2020 1 commit

[feat] allow fp16 optimizer state with Adam (#41) · 8ee5a8ff

Jun Ru Anderson authored Aug 18, 2020



Allow training with optimizer state in fp16. Use an enum to select from full-precision, mixed precision, memory efficient mixed precision and pure fp16. Improve clarity of testing code
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

8ee5a8ff

14 Aug, 2020 1 commit

[feat] add mixed precision Adam (#40) · e2d8f573

Jun Ru Anderson authored Aug 14, 2020



Add support for mixed-precision (half precision params, full precision gradients) and memory-efficient (half precision params and half precision gradients) training with Adam
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

e2d8f573

31 Jul, 2020 3 commits
- [feat] add FusedAdam (#10) · bfba68d8
  Jun Ru Anderson authored Jul 30, 2020
```
Add FusedAdam, update benchmark and add tests.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>
```
  bfba68d8
- [test] switch benchmark to use Adam rather than SGD · e4b2ffd4
  Jun Ru Anderson authored Jul 17, 2020
  
  e4b2ffd4
- [feat] add Transformer gpipe benchmark · 74181b08
  Jun Ru Anderson authored Jul 17, 2020
  
  74181b08
08 Jul, 2020 1 commit
- Initial commit · 0cd65242
  Mandeep Singh Baines authored Jul 07, 2020
  
  0cd65242