Commits · 011c0c41dfb2a3846169b4d091eb438077fbbe3b · OpenDAS / fairscale

03 Feb, 2021 1 commit

[refactor] Refactor and enable multiprocess nn.Pipe benchmarks. (#319) · cd186441

anj-s authored Feb 03, 2021



* mp cleanup

* round of multiprocess refactoring

* test golden run

* print cuda stats

* fix lint errors

* enable multiprocess pipe benchmarks

* set world size to be available gpus

* more changes

* use synthetic loaders for intermediate pipeline stages

* merged master

* fix for the devices property

* dataloader fix

* modify rank check

* print wps stats

* enable verification

* fix logging

* fix flag name

* fix flag name

* check for rank

* fix indent

* pass args

* pass args

* modify golden data

* remove unused print messsage

* fix lint errors

* add comments

* fix benchmarks
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

cd186441

29 Jan, 2021 1 commit

[test]: test with py39 + torch 1.8 nightly (#339) · e348806b

Min Xu authored Jan 29, 2021

* [test]: test with py39 + torch 1.8 nightly

* version fix

* more fix

* fix version function for nightly version

* fix torch_pg build

* invalidate cache

* separate benchmark requirements

* comment

* fixed mypy

* fixed a test

e348806b

27 Jan, 2021 1 commit

[cleanup] run coverage on latest PyTorch only (#331) · 73221557

msbaines authored Jan 27, 2021

Also, we can save time by only running unittests once instead of
twice (with and without coverage).

73221557

25 Jan, 2021 1 commit

[test] cover python 3.7 to 3.9 on CPU (#303) · 8459634f

Min Xu authored Jan 25, 2021

* [test] cover python 3.7 to 3.9 on CPU

- covering common python versions on CPU tests
- added doc build test

* add doc build test

* skipping failing tests on py39

* catching doc build warnings

* add doc build to py38 and py39

* minor fix

* fix doc build for adascale

* removed dead code

* fix the skipping

* skip unit test for py39

* add failing example

* no more py39 skipping the tests

8459634f

16 Jan, 2021 1 commit
- [chore] update to torch v1.7.1 (#251) · 8d710c82
  msbaines authored Jan 15, 2021
  
  8d710c82
15 Jan, 2021 1 commit
- [chore][ci] simplify torch installation (#312) · 9eeedda3
  msbaines authored Jan 15, 2021
  
  9eeedda3
11 Jan, 2021 1 commit

[chore][ci] restore 1.5 & 1.6 tests and compatibility (#306) · 2d954203

Benjamin Lefaudeux authored Jan 11, 2021

* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing
* fixing oss backcompat, trying to fix rpc in old pytorch also
* fixing the file based init in torch 1.5

2d954203

05 Jan, 2021 1 commit

[fix] Flaky tests (#283) · 79365ee6

Benjamin Lefaudeux authored Jan 04, 2021

* adding the pytest timeout plugin to properly root out hanging tests
* removing redundant code, slightly more reasonable timeout, works on single cuda
* finding the root bug for some of the cpu hangs, rpc init
* propagating all the rpc init test changes to the pipe and model parallel tests

79365ee6

30 Dec, 2020 1 commit

[fix] regression testing oss+sharded_ddp only (#281) · fc1a40e1

Benjamin Lefaudeux authored Dec 29, 2020

- tighter regression detection, based on the best case vs. worst case
- still run all configurations, useful for comparisons but not a target

fc1a40e1

22 Dec, 2020 1 commit
- [fix] CircleCI vs pip hotfix (#267) · 381d28ca
  Benjamin Lefaudeux authored Dec 22, 2020
```
* keep two torch 1.7 profiles to save cuda 10.1 testing
```
  381d28ca
30 Nov, 2020 1 commit
- [fix] OSS ad-hoc perf regression fix, more inconsistent than expected (#214) · 835ecb0c
  Benjamin Lefaudeux authored Nov 30, 2020
  
  835ecb0c
22 Nov, 2020 1 commit

[fix] More robust stats for regression testing (#204) · 2b121242

Benjamin Lefaudeux authored Nov 22, 2020

* testing median and MAD

* synchronize on kernels to make sure that we're measuring the actual completion time

* adjusting the circleci threshold, not that the speed has regressed but because we measure proper cuda execution time

2b121242

21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

20 Nov, 2020 1 commit
- [fix] make fairscale.utils a proper package (#200) · 35d4129f
  msbaines authored Nov 19, 2020
  
  35d4129f
19 Nov, 2020 1 commit
- [test] run moe mpi tests using torch_pg (#197) · cd496b36
  msbaines authored Nov 19, 2020
  
  cd496b36
06 Nov, 2020 1 commit
- [feature] Add a torch AMP benchmark option and test job (#175) · cc766aa5
  Benjamin Lefaudeux authored Nov 05, 2020
```
* oss benchmark: add an --amp option
* add a circleCI test
```
  cc766aa5
30 Oct, 2020 1 commit
- [chore] add circleci testing of torch==1.5.1 (#172) · 4247f602
  msbaines authored Oct 29, 2020
  
  4247f602
29 Oct, 2020 1 commit
- [chore] update to torch v1.7.0 (#171) · ace61a41
  msbaines authored Oct 28, 2020
  
  ace61a41
28 Oct, 2020 1 commit
- [chore] update isort to 5.6.4 (#170) · ea9876e3
  msbaines authored Oct 27, 2020
  
  ea9876e3
23 Oct, 2020 1 commit
- [feat][minor] OSS Benchmark - add a debug option to add some tensor dumps (#166) · 34f35fba
  Benjamin Lefaudeux authored Oct 23, 2020
```
* Some ease of use in the benchmark tool, add a debug option
```
  34f35fba
22 Oct, 2020 1 commit
- [bugfix] hotfix oss benchmark regression testing (#163) · 6be7f973
  Benjamin Lefaudeux authored Oct 21, 2020
  
  6be7f973
21 Oct, 2020 1 commit

[feature] OSS: Use MNIST to benchmark (#159) · 6f8a8652

Benjamin Lefaudeux authored Oct 21, 2020

* switching to MNIST
* updating the reference values, should be good to go
* download dataset once for all processes

6f8a8652

17 Oct, 2020 1 commit
- [feat][minor] OSS: benchmark - adding a cpu option (#144) · 10062e58
  Benjamin Lefaudeux authored Oct 16, 2020
```
* adding a cpu option
* adjust the reference loss
```
  10062e58
16 Oct, 2020 1 commit
- [feat] moe: add all_to_all backward support (#137) · d99c445a
  msbaines authored Oct 16, 2020
  
  d99c445a
14 Oct, 2020 1 commit
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
10 Oct, 2020 1 commit
- [bugfix] OSS no reduce loss (#133) · 177151e0
  Benjamin Lefaudeux authored Oct 09, 2020
```
* bugfix
* adjust default non-regression loss, not all_reduced now
```
  177151e0
09 Oct, 2020 1 commit
- [minor] OSS: bring DDP in the benchmark (#130) · bfd88cad
  Benjamin Lefaudeux authored Oct 08, 2020
```
More realistic benchmarks, comparing apples to apples. DDP/OSS+DDP/OSS+SDP
```
  bfd88cad
08 Oct, 2020 1 commit
- [fix] OSS unit test to check data group (#129) · 81ac5b28
  Benjamin Lefaudeux authored Oct 08, 2020
```
* new unit test to catch rank issues in OSS
```
  81ac5b28
01 Oct, 2020 1 commit
- [fix] OSS: Eager gradient release - free memory (#120) · 1c2a6f6b
  Benjamin Lefaudeux authored Sep 30, 2020
```
* minor, but gives some memory back
* adjust CI and regression checks to 4 gpu
```
  1c2a6f6b
24 Sep, 2020 1 commit

[fix] OSS benchmark cleanup (#109) · 53553474

Benjamin Lefaudeux authored Sep 24, 2020

- small benchmark refactor, only one for all backends and ddp
- deterministic, enforce alignment with pytorch ddp

53553474

22 Sep, 2020 1 commit

[bug] Make OSS Gloo-compliant (#102) · b488dcfa

Benjamin Lefaudeux authored Sep 22, 2020

* Broadcasting grad-enabled tensors is forbidden in Gloo, because this is not differentiable. Workaround

b488dcfa

17 Sep, 2020 2 commits

Multi-process pipe (#90) · 63f7796a

Tom Birch authored Sep 17, 2020

Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
* Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
* Added support for lazy construction of modules (see lazy_construction for an example)
* Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
* Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess

63f7796a

[feat] Sharded DDP - small refactor and new features (#97) · 49a198c9

Benjamin Lefaudeux authored Sep 17, 2020

- rename oss_ddp to ShardedDataParallel
- some refactoring
- ShardedDataParallel owns the sharded optimizer, exposed if need be
- some small perf bumps

49a198c9

03 Sep, 2020 1 commit

Add grad scaler (#48) · b6a5e634

Jun Ru Anderson authored Sep 03, 2020



Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

b6a5e634

21 Aug, 2020 1 commit

[feat] Simple macro OSS benchmark (#47) · 46c3776b

Benjamin Lefaudeux authored Aug 21, 2020



* initial commit, dummy training loop, pure pytorch but not DDP

* probably slightly broken, but rough DDP benchmark run

* adding the torchvision requirement for testing

* brainfart

* reduce the loss, do something slightly distributed

* Some cleanup, distributing the training on two GPUs

* some cleanup + adding a vanilla run, still not good to go

* less silly defaults, gtg for a start I think

* smaller batch to fit the smaller gpus used in the circleci rigs

* Adding some options for the benchmark, and regression testing

* [test] set torch seed for Adam tests (#49)

Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

* linting, I really need to automate this isort insanity
Co-authored-by: Jun Ru Anderson <33384298+andersonic@users.noreply.github.com>
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

46c3776b

14 Aug, 2020 1 commit
- [test] using PyTorch v1.6 for Lint checks (#36) · b35a3d3f
  msbaines authored Aug 13, 2020
  
  b35a3d3f
13 Aug, 2020 2 commits
- [chore] enable codecov (#35) · 2f638e5a
  msbaines authored Aug 13, 2020
  
  2f638e5a
- [chore] run tests on PyTorch 1.6.0 and gpu tests on 1.6.0 and 1.5.1 (#34) · 571f5efa
  msbaines authored Aug 13, 2020
  
  571f5efa
31 Jul, 2020 2 commits
- [feat] add FusedAdam (#10) · bfba68d8
  Jun Ru Anderson authored Jul 30, 2020
```
Add FusedAdam, update benchmark and add tests.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>
```
  bfba68d8
- [test] Use PyTorch v1.5 for ci (#7) · 8634280c
  msbaines authored Jul 22, 2020
  
  8634280c