Commits · 3d02f052e565f6bbef86d54d361165ae5d2cee7b · OpenDAS / fairscale

08 Jan, 2021 3 commits
- [refactor][OSS] Adding a pytorch parity unit test (#298) · 3d02f052
  Benjamin Lefaudeux authored Jan 08, 2021
```
* adding a parity unit test
* code review, better testing, use torch defaults and check for the loss, log world size
```
  3d02f052
- [refactor][OSS] Removing ad-hoc object broadcast, use pytorch's (#297) · 3399e97c
  Benjamin Lefaudeux authored Jan 08, 2021
  
  3399e97c
- [feat] Support model parallelism in OSS (#287) · 9faad392
  Joshua Meier authored Jan 08, 2021
```
* add additional unit test
* support model parallelism in oss
```
  9faad392
07 Jan, 2021 1 commit
- [fix] Adding missing CUDA files in the pip package v0.1.4 (#295) · 53a912c3
  Benjamin Lefaudeux authored Jan 07, 2021
```
* trying to fix the missing files in the pip package (not in this diff)
* adding a long description, more pypi friendly
```
  53a912c3
05 Jan, 2021 2 commits

Benjamin Lefaudeux authored Jan 04, 2021

* adding the pytest timeout plugin to properly root out hanging tests
* removing redundant code, slightly more reasonable timeout, works on single cuda
* finding the root bug for some of the cpu hangs, rpc init
* propagating all the rpc init test changes to the pipe and model parallel tests

79365ee6

[chore] creating 0.1.3 to align numbering everywhere (#289) · 7cc8b34a
Benjamin Lefaudeux authored Jan 04, 2021
```
release pip package to follow suit
```
7cc8b34a

04 Jan, 2021 3 commits

[refactor] Modify train and benchmark functions to account for multiple models and datasets. (#260) · 656fc319

anj-s authored Jan 04, 2021



* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* fix lint errors

* refactor common utilities

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* addressed PR comments

* addressed PR comments

* fixed typos

* initialize var

* rename seq_pred to lm

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

656fc319

[chore] 0.1.2 version bump (#285) · a21f50f9
Benjamin Lefaudeux authored Jan 04, 2021

a21f50f9

[feat] sync adascale from internal repo, support add_param_group (#266) · 3932a1f6

Min Xu authored Jan 04, 2021

* [feat] sync adascale from internal repo

- tbd

testing: tbd

* Update argument document of __init__

* update documentation around set_num_gradients_to_accumulate

* added checking code for proper API calling places

* rename internal APIs to make them internal

* updated changelog

* added support for add_param_group and its unit test

* added unit test for set_num_gradients_to_accumulate

* added debias_ewma unit test

* fixed test_set_num_gradients_to_accumulate (need zero_grad() call)

* added missing zero_grad() to test_lr_scheduler

* fixed test_add_param_group with respect to optim.zero_grad()

* added test_gradient_value

* added test_scale_not_equal_default for scale != world_size * grad_accum

* added test_unhook()

* removed print statements

* fixed a typo

* addressed Ben's comment

3932a1f6

02 Jan, 2021 1 commit
- [fix] Typo in ShardedDDP unit test (#282) · 84a3bdbe
  Benjamin Lefaudeux authored Jan 01, 2021
```
* fix typo, backend for CPU test
```
  84a3bdbe
30 Dec, 2020 5 commits

[feat] Add Torch Sync Batchnorm handle in sharded DDP (#265) · 1c8d219d
Sean Naren authored Dec 30, 2020
```
* Add function to add handle for sync BN
* Add test to ensure batch norm handles have been added
```
1c8d219d

[fix] regression testing oss+sharded_ddp only (#281) · fc1a40e1

Benjamin Lefaudeux authored Dec 29, 2020

- tighter regression detection, based on the best case vs. worst case
- still run all configurations, useful for comparisons but not a target

fc1a40e1

[refactor] Remove unused variables, add configuration objects and basic... · 3c727ec5

anj-s authored Dec 29, 2020


[refactor] Remove unused variables, add configuration objects and basic cleanup for pipe benchmarks. (#252)

* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* rename variable
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

3c727ec5

[fix] Hopeful Circleci hangfix - teardown if raising exception (#280) · 8321f682
Benjamin Lefaudeux authored Dec 29, 2020
```
* timeout on the process join, expose a hanging process
* make sure that teardown is always called
```
8321f682

[fix] Dead code removal for OSS (#276) · fb8d9137

Benjamin Lefaudeux authored Dec 29, 2020

* removing a dead call since ShardedDDP, small speedup
* unrelated, but filling in the changelog
* another nit

fb8d9137

29 Dec, 2020 2 commits
- [hotfix] Catching properly a given test failing if not enough gpus (#274) · 7abaa2be
  Benjamin Lefaudeux authored Dec 28, 2020
```
* catching properly a given test failing if not enough gpus
```
  7abaa2be
- [feature] OSS: add unit test for distributed checkpointing (#273) · 60c8de4a
  Joshua Meier authored Dec 28, 2020
```
author: Joshua Meier
```
  60c8de4a
28 Dec, 2020 2 commits
- [chore] Move all unit tests dist init to being file based (#272) · b640cab5
  Benjamin Lefaudeux authored Dec 28, 2020
```
* file based dist init
* nicer handling of broken world sizes vs. number of available GPUs, do not break but warn out
```
  b640cab5
- [doc] better ShardedGradScaler example (#271) · 290afecd
  Benjamin Lefaudeux authored Dec 27, 2020
  
  290afecd
24 Dec, 2020 1 commit

[chore] Update changelog (#268) · 18455bf0

Min Xu authored Dec 23, 2020

* Update changelog

missed this item from previous AdaScale commit.

* More change log

* Addressed review comments

18455bf0

22 Dec, 2020 2 commits

[fix] CircleCI vs pip hotfix (#267) · 381d28ca
Benjamin Lefaudeux authored Dec 22, 2020
```
* keep two torch 1.7 profiles to save cuda 10.1 testing
```
381d28ca

[OSS] Balance the trainable params only (#262) · c386e937

Benjamin Lefaudeux authored Dec 21, 2020

* fix, one liner

* adjust so that frozen trunks get spread still, even if this should have little consequences

* removing dead code, hopeful unit test fix

* now with some linting..

* adding a proper unit test case

c386e937

19 Dec, 2020 1 commit

[OSS] Getting rid of the "should bucket" hash table, just use a list + non... · ca74ee22

Benjamin Lefaudeux authored Dec 19, 2020

[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259)

* Getting rid of the "should bucket" hash table, just use a list
Properly handle all params, with or without requires_grad

* make sure that this case is unit tested

ca74ee22

17 Dec, 2020 3 commits
- [fix] grad scaler optional process group (#257) · bd7e25a5
  Benjamin Lefaudeux authored Dec 17, 2020
  
  bd7e25a5
- [fix] OSS - resolve fp16 overflow in clip grad norm (#263) · 2df5ca2d
  Joshua Meier authored Dec 17, 2020
  
  2df5ca2d
- [fix] OSS - typo + small perf fix (#256) · 2d9243bf
  Benjamin Lefaudeux authored Dec 16, 2020
```
* typo, sorry about that

* small perf fix
```
  2d9243bf
16 Dec, 2020 6 commits

[perf] ShardedDDP: better handling of the callback queue, try to consume it as we go. (#254) · 351f35e1
Benjamin Lefaudeux authored Dec 16, 2020
```
* Better handling of the callback queue, try to consume it as we go.

* dumping buckets for the reduce part, always the same unused params issue
```
351f35e1

[docs] lintfixes (#255) · 19cb5938

Benjamin Lefaudeux authored Dec 16, 2020



* lintfixes

* come on black

* Update tutorial_pipe_multiprocess.py

make RANK global like the other tutorials
Co-authored-by: Vittorio Caggiano <caggiano@gmail.com>

19cb5938

[doc] Update README.md (#244) · 550f1ab7

VitaliyLi authored Dec 16, 2020



* Update README.md

* Update README.md

update capitalization
Co-authored-by: Vittorio Caggiano <caggiano@gmail.com>

550f1ab7

[feat] add CPU support to tutorials in examples + factorize tutorials (#247) · 02478eb3

jessijzhao authored Dec 15, 2020

* [feat] add CPU support to tutorials in examples

* now works on a machine without cuda
* fixes some minor typos

* [cleanup] factorize tutorials in examples

* collects duplicate code across tutorials in helpers.py

* [fix] getData in tutorials now returns iterable

02478eb3

[fix] solutions to recent pip's isolation failing to build from source (#249) · 7e5ddcd2
Stas Bekman authored Dec 15, 2020

7e5ddcd2

[feat]: AdaScale work with lr_scheduler and tests, examples (#229) · d65cd838

Min Xu authored Dec 15, 2020

* [doc]: AdaScale example and notes

* formatted notes correctly as suggested by Benjamin

* added feature and unit test to make sure lr_scheduler works

* update the example with lr_scheduler

* fixed doc with "make html"

* addressed Mike's suggestions

d65cd838

15 Dec, 2020 1 commit
- [cleanup] ShardedDDP - inline gatekeeper (#248) · 4402c410
  Benjamin Lefaudeux authored Dec 15, 2020
  
  4402c410
14 Dec, 2020 1 commit

[fix] more adascale gradient accumulation tests and smoothing factor fix (#235) · f74afebb

Min Xu authored Dec 14, 2020

* better ddp adascale tests

* make sure the single node test use the same test cases and expected gains

* added unit test that covers smoothing factor

- tested by re-introducing the bug and see the test fail as expected.

f74afebb

10 Dec, 2020 2 commits

[doc] updating the pipe balance doc a bit (#243) · 2eef71b9

Min Xu authored Dec 10, 2020

* [doc] updating the pipe balance doc a bit

- Also added a warning to pipeline.py when the partition output is not
supported.

* addressed Mandeep's comment

2eef71b9

[fix] Check ShardedDDP / DDP parity + bugfix (#242) · 138b2033

Benjamin Lefaudeux authored Dec 09, 2020

* unit test checking ddp and sharded_ddp equivalence, reproducing the issue that Sean spotted
* fixing the issue, not counting requests in flight properly
* adding a multiple optimizers case

138b2033

09 Dec, 2020 1 commit
- [fix] Renaming large logo file - free of spaces (#240) · 6afbe677
  Benjamin Lefaudeux authored Dec 09, 2020
  
  6afbe677
07 Dec, 2020 1 commit
- [fix] ShardedGradScaler - remove the strict optimizer type requirement (#237) · c6f40418
  Benjamin Lefaudeux authored Dec 07, 2020
```
* removing strict typing requirement, broken by ClassyVision
```
  c6f40418
06 Dec, 2020 1 commit
- [fix] skipping NCCL tests on 2-GPU systems (#233) · bb468670
  Min Xu authored Dec 05, 2020
  
  bb468670
05 Dec, 2020 1 commit
- [doc] hotfixes, old documentation (#232) · 92210136
  Benjamin Lefaudeux authored Dec 04, 2020
```
Thanks Jessica for the heads up !
```
  92210136