Commits · fc1a40e16695f29e694a8345e1d13b60387b335c · OpenDAS / fairscale

30 Dec, 2020 4 commits

[fix] regression testing oss+sharded_ddp only (#281) · fc1a40e1

Benjamin Lefaudeux authored Dec 29, 2020

- tighter regression detection, based on the best case vs. worst case
- still run all configurations, useful for comparisons but not a target

fc1a40e1

[refactor] Remove unused variables, add configuration objects and basic... · 3c727ec5

anj-s authored Dec 29, 2020


[refactor] Remove unused variables, add configuration objects and basic cleanup for pipe benchmarks. (#252)

* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* rename variable
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

3c727ec5

[fix] Hopeful Circleci hangfix - teardown if raising exception (#280) · 8321f682
Benjamin Lefaudeux authored Dec 29, 2020
```
* timeout on the process join, expose a hanging process
* make sure that teardown is always called
```
8321f682

[fix] Dead code removal for OSS (#276) · fb8d9137

Benjamin Lefaudeux authored Dec 29, 2020

* removing a dead call since ShardedDDP, small speedup
* unrelated, but filling in the changelog
* another nit

fb8d9137

29 Dec, 2020 2 commits
- [hotfix] Catching properly a given test failing if not enough gpus (#274) · 7abaa2be
  Benjamin Lefaudeux authored Dec 28, 2020
```
* catching properly a given test failing if not enough gpus
```
  7abaa2be
- [feature] OSS: add unit test for distributed checkpointing (#273) · 60c8de4a
  Joshua Meier authored Dec 28, 2020
```
author: Joshua Meier
```
  60c8de4a
28 Dec, 2020 2 commits
- [chore] Move all unit tests dist init to being file based (#272) · b640cab5
  Benjamin Lefaudeux authored Dec 28, 2020
```
* file based dist init
* nicer handling of broken world sizes vs. number of available GPUs, do not break but warn out
```
  b640cab5
- [doc] better ShardedGradScaler example (#271) · 290afecd
  Benjamin Lefaudeux authored Dec 27, 2020
  
  290afecd
24 Dec, 2020 1 commit

[chore] Update changelog (#268) · 18455bf0

Min Xu authored Dec 23, 2020

* Update changelog

missed this item from previous AdaScale commit.

* More change log

* Addressed review comments

18455bf0

22 Dec, 2020 2 commits

[fix] CircleCI vs pip hotfix (#267) · 381d28ca
Benjamin Lefaudeux authored Dec 22, 2020
```
* keep two torch 1.7 profiles to save cuda 10.1 testing
```
381d28ca

[OSS] Balance the trainable params only (#262) · c386e937

Benjamin Lefaudeux authored Dec 21, 2020

* fix, one liner

* adjust so that frozen trunks get spread still, even if this should have little consequences

* removing dead code, hopeful unit test fix

* now with some linting..

* adding a proper unit test case

c386e937

19 Dec, 2020 1 commit

[OSS] Getting rid of the "should bucket" hash table, just use a list + non... · ca74ee22

Benjamin Lefaudeux authored Dec 19, 2020

[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259)

* Getting rid of the "should bucket" hash table, just use a list
Properly handle all params, with or without requires_grad

* make sure that this case is unit tested

ca74ee22

17 Dec, 2020 3 commits
- [fix] grad scaler optional process group (#257) · bd7e25a5
  Benjamin Lefaudeux authored Dec 17, 2020
  
  bd7e25a5
- [fix] OSS - resolve fp16 overflow in clip grad norm (#263) · 2df5ca2d
  Joshua Meier authored Dec 17, 2020
  
  2df5ca2d
- [fix] OSS - typo + small perf fix (#256) · 2d9243bf
  Benjamin Lefaudeux authored Dec 16, 2020
```
* typo, sorry about that

* small perf fix
```
  2d9243bf
16 Dec, 2020 6 commits

[perf] ShardedDDP: better handling of the callback queue, try to consume it as we go. (#254) · 351f35e1
Benjamin Lefaudeux authored Dec 16, 2020
```
* Better handling of the callback queue, try to consume it as we go.

* dumping buckets for the reduce part, always the same unused params issue
```
351f35e1

[docs] lintfixes (#255) · 19cb5938

Benjamin Lefaudeux authored Dec 16, 2020



* lintfixes

* come on black

* Update tutorial_pipe_multiprocess.py

make RANK global like the other tutorials
Co-authored-by: Vittorio Caggiano <caggiano@gmail.com>

19cb5938

[doc] Update README.md (#244) · 550f1ab7

VitaliyLi authored Dec 16, 2020



* Update README.md

* Update README.md

update capitalization
Co-authored-by: Vittorio Caggiano <caggiano@gmail.com>

550f1ab7

[feat] add CPU support to tutorials in examples + factorize tutorials (#247) · 02478eb3

jessijzhao authored Dec 15, 2020

* [feat] add CPU support to tutorials in examples

* now works on a machine without cuda
* fixes some minor typos

* [cleanup] factorize tutorials in examples

* collects duplicate code across tutorials in helpers.py

* [fix] getData in tutorials now returns iterable

02478eb3

[fix] solutions to recent pip's isolation failing to build from source (#249) · 7e5ddcd2
Stas Bekman authored Dec 15, 2020

7e5ddcd2

[feat]: AdaScale work with lr_scheduler and tests, examples (#229) · d65cd838

Min Xu authored Dec 15, 2020

* [doc]: AdaScale example and notes

* formatted notes correctly as suggested by Benjamin

* added feature and unit test to make sure lr_scheduler works

* update the example with lr_scheduler

* fixed doc with "make html"

* addressed Mike's suggestions

d65cd838

15 Dec, 2020 1 commit
- [cleanup] ShardedDDP - inline gatekeeper (#248) · 4402c410
  Benjamin Lefaudeux authored Dec 15, 2020
  
  4402c410
14 Dec, 2020 1 commit

[fix] more adascale gradient accumulation tests and smoothing factor fix (#235) · f74afebb

Min Xu authored Dec 14, 2020

* better ddp adascale tests

* make sure the single node test use the same test cases and expected gains

* added unit test that covers smoothing factor

- tested by re-introducing the bug and see the test fail as expected.

f74afebb

10 Dec, 2020 2 commits

[doc] updating the pipe balance doc a bit (#243) · 2eef71b9

Min Xu authored Dec 10, 2020

* [doc] updating the pipe balance doc a bit

- Also added a warning to pipeline.py when the partition output is not
supported.

* addressed Mandeep's comment

2eef71b9

[fix] Check ShardedDDP / DDP parity + bugfix (#242) · 138b2033

Benjamin Lefaudeux authored Dec 09, 2020

* unit test checking ddp and sharded_ddp equivalence, reproducing the issue that Sean spotted
* fixing the issue, not counting requests in flight properly
* adding a multiple optimizers case

138b2033

09 Dec, 2020 1 commit
- [fix] Renaming large logo file - free of spaces (#240) · 6afbe677
  Benjamin Lefaudeux authored Dec 09, 2020
  
  6afbe677
07 Dec, 2020 1 commit
- [fix] ShardedGradScaler - remove the strict optimizer type requirement (#237) · c6f40418
  Benjamin Lefaudeux authored Dec 07, 2020
```
* removing strict typing requirement, broken by ClassyVision
```
  c6f40418
06 Dec, 2020 1 commit
- [fix] skipping NCCL tests on 2-GPU systems (#233) · bb468670
  Min Xu authored Dec 05, 2020
  
  bb468670
05 Dec, 2020 1 commit
- [doc] hotfixes, old documentation (#232) · 92210136
  Benjamin Lefaudeux authored Dec 04, 2020
```
Thanks Jessica for the heads up !
```
  92210136
04 Dec, 2020 2 commits

Logo (#227) · 47e57935

Vittorio Caggiano authored Dec 04, 2020



* add logo

* Update README.md
Co-authored-by: Vittorio Caggiano <caggiano@fb.com>

47e57935

[fix] Fix iGPT buckets with ShardedDDP (#223) · 6d223777

Benjamin Lefaudeux authored Dec 03, 2020

* proper unit testing, but no other solution than disabling bucketing for now, couple of options tested do not work

6d223777

03 Dec, 2020 1 commit

[feat] AdaScale: Gradient Accumulation and Add PyTest unit tests (#202) · ce5860ea

Min Xu authored Dec 03, 2020

* added AdaScale to README

* [adascale] added gradient accumulation

- added gradient accumulation
- tested with cifar full trainings with different value of accumulation
and verified the full accuracy is obtained
- also removed the patch optimize flag until we need it

* [adascale] adding pytest

- added basic and ddp tests and grad_accum
- closes #195

* added changelog

* added ddp grad_accum test

* moved ddp and non-ddp tests into separate files

* added checkpoint test

* more doc

* addressed Mike's comments

ce5860ea

02 Dec, 2020 1 commit
- [fix] make sure pip package includes header files (#221) · 867cc2df
  msbaines authored Dec 01, 2020
```
Fixes #190
```
  867cc2df
01 Dec, 2020 4 commits
- [docs] Minor refactor, trying to improve a little bit the html (#220) · 8b5b9540
  Benjamin Lefaudeux authored Dec 01, 2020
  
  8b5b9540
- [chore] Refactor unit testing, shared utils (#218) · e83da060
  Benjamin Lefaudeux authored Dec 01, 2020
  
  e83da060
- [chore] create v0.1.0 (#219) · 1db8bbda
  msbaines authored Dec 01, 2020
  
  1db8bbda
- [fix][Pipe] fallback for Pipe tests on internal pytorch numbering (#216) · 4d8f2e59
  Benjamin Lefaudeux authored Nov 30, 2020
```
* fallback on internal pytorch numbering
```
  4d8f2e59
30 Nov, 2020 1 commit
- [fix] OSS ad-hoc perf regression fix, more inconsistent than expected (#214) · 835ecb0c
  Benjamin Lefaudeux authored Nov 30, 2020
  
  835ecb0c
27 Nov, 2020 1 commit
- [doc] Fixing relative html links (#212) · d09f5aa2
  Benjamin Lefaudeux authored Nov 26, 2020
```
Fixing the relative positions of the html docs.
```
  d09f5aa2
26 Nov, 2020 1 commit
- [fix] Adding a GradScaler import guard for amp with pytorch 1.5 (#210) · 8e85ce8c
  Benjamin Lefaudeux authored Nov 25, 2020
  
  8e85ce8c