Commits · 44b9bcd811cb7f38980a265f53342bee2b602507 · OpenDAS / fairscale

19 Jan, 2021 1 commit

[refactor] Enable benchmarks/pipe.py and merge real and synthetic input pipeline. (#286) · 44b9bcd8

anj-s authored Jan 19, 2021



* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* fix lint errors

* refactor common utilities

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* addressed PR comments

* addressed PR comments

* fixed typos

* initialize var

* rename seq_pred to lm

* fix lint errors

* move datasets and models into separate folders

* add the folders created

* fix lint errors

* create golden config to stats mapping

* add common batching for both synthetic and real data

* fixed lint errors

* enable real pipe benchmakrs with new golden data

* reduce seq len to avoid OOM

* updated golden data

* add logging

* add golden data

* add golden data

* fix lint errors

* add doc string

* remove commented out line

* address comments

* rename imports

* refactor common logic in dataloaders

* add golden configs

* lint changes
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

44b9bcd8

16 Jan, 2021 1 commit
- [chore] update to torch v1.7.1 (#251) · 8d710c82
  msbaines authored Jan 15, 2021
  
  8d710c82
15 Jan, 2021 2 commits
- [chore][ci] simplify torch installation (#312) · 9eeedda3
  msbaines authored Jan 15, 2021
  
  9eeedda3
- [feat][ShardedDDP] Support the original module's attributes (#309) · 3e2547c3
  Benjamin Lefaudeux authored Jan 15, 2021
```
* minor, but ease of life, one less papercut
```
  3e2547c3
12 Jan, 2021 1 commit
- [docs] clarify per-GPU batch size for AdaScale (#301) · 43a27cd4
  Min Xu authored Jan 11, 2021
```
- clarify that per-GPU batch size is not increased with AdaScale.
```
  43a27cd4
11 Jan, 2021 2 commits

[chore][ci] restore 1.5 & 1.6 tests and compatibility (#306) · 2d954203

Benjamin Lefaudeux authored Jan 11, 2021

* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing
* fixing oss backcompat, trying to fix rpc in old pytorch also
* fixing the file based init in torch 1.5

2d954203

[perf][OSS] tensor views for bucketing (#300) · 6219b57b

Benjamin Lefaudeux authored Jan 11, 2021

* min bucket size with model size
* resize the bucket after all the params have been squeezed in, save a tiny bit of memory
* minor, ensure that the cache is freed and improve the comments

6219b57b

08 Jan, 2021 5 commits
- [doc] Minor additions to ShardedDDP docs (#299) · b202804a
  Benjamin Lefaudeux authored Jan 08, 2021
  
  b202804a
- [perf][minor] ShardedDDP micro-optim (#296) · 11beea69
  Benjamin Lefaudeux authored Jan 08, 2021
```
* minor, not life changing but removing a dependency on runtime optim
```
  11beea69
- [refactor][OSS] Adding a pytorch parity unit test (#298) · 3d02f052
  Benjamin Lefaudeux authored Jan 08, 2021
```
* adding a parity unit test
* code review, better testing, use torch defaults and check for the loss, log world size
```
  3d02f052
- [refactor][OSS] Removing ad-hoc object broadcast, use pytorch's (#297) · 3399e97c
  Benjamin Lefaudeux authored Jan 08, 2021
  
  3399e97c
- [feat] Support model parallelism in OSS (#287) · 9faad392
  Joshua Meier authored Jan 08, 2021
```
* add additional unit test
* support model parallelism in oss
```
  9faad392
07 Jan, 2021 1 commit
- [fix] Adding missing CUDA files in the pip package v0.1.4 (#295) · 53a912c3
  Benjamin Lefaudeux authored Jan 07, 2021
```
* trying to fix the missing files in the pip package (not in this diff)
* adding a long description, more pypi friendly
```
  53a912c3
05 Jan, 2021 2 commits

[fix] Flaky tests (#283) · 79365ee6

Benjamin Lefaudeux authored Jan 04, 2021

* adding the pytest timeout plugin to properly root out hanging tests
* removing redundant code, slightly more reasonable timeout, works on single cuda
* finding the root bug for some of the cpu hangs, rpc init
* propagating all the rpc init test changes to the pipe and model parallel tests

79365ee6

[chore] creating 0.1.3 to align numbering everywhere (#289) · 7cc8b34a
Benjamin Lefaudeux authored Jan 04, 2021
```
release pip package to follow suit
```
7cc8b34a

04 Jan, 2021 3 commits

[refactor] Modify train and benchmark functions to account for multiple models and datasets. (#260) · 656fc319

anj-s authored Jan 04, 2021



* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* fix lint errors

* refactor common utilities

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* addressed PR comments

* addressed PR comments

* fixed typos

* initialize var

* rename seq_pred to lm

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

656fc319

[chore] 0.1.2 version bump (#285) · a21f50f9
Benjamin Lefaudeux authored Jan 04, 2021

a21f50f9

[feat] sync adascale from internal repo, support add_param_group (#266) · 3932a1f6

Min Xu authored Jan 04, 2021

* [feat] sync adascale from internal repo

- tbd

testing: tbd

* Update argument document of __init__

* update documentation around set_num_gradients_to_accumulate

* added checking code for proper API calling places

* rename internal APIs to make them internal

* updated changelog

* added support for add_param_group and its unit test

* added unit test for set_num_gradients_to_accumulate

* added debias_ewma unit test

* fixed test_set_num_gradients_to_accumulate (need zero_grad() call)

* added missing zero_grad() to test_lr_scheduler

* fixed test_add_param_group with respect to optim.zero_grad()

* added test_gradient_value

* added test_scale_not_equal_default for scale != world_size * grad_accum

* added test_unhook()

* removed print statements

* fixed a typo

* addressed Ben's comment

3932a1f6

02 Jan, 2021 1 commit
- [fix] Typo in ShardedDDP unit test (#282) · 84a3bdbe
  Benjamin Lefaudeux authored Jan 01, 2021
```
* fix typo, backend for CPU test
```
  84a3bdbe
30 Dec, 2020 5 commits

[feat] Add Torch Sync Batchnorm handle in sharded DDP (#265) · 1c8d219d
Sean Naren authored Dec 30, 2020
```
* Add function to add handle for sync BN
* Add test to ensure batch norm handles have been added
```
1c8d219d

[fix] regression testing oss+sharded_ddp only (#281) · fc1a40e1

Benjamin Lefaudeux authored Dec 29, 2020

- tighter regression detection, based on the best case vs. worst case
- still run all configurations, useful for comparisons but not a target

fc1a40e1

[refactor] Remove unused variables, add configuration objects and basic... · 3c727ec5

anj-s authored Dec 29, 2020


[refactor] Remove unused variables, add configuration objects and basic cleanup for pipe benchmarks. (#252)

* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* rename variable
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

3c727ec5

[fix] Hopeful Circleci hangfix - teardown if raising exception (#280) · 8321f682
Benjamin Lefaudeux authored Dec 29, 2020
```
* timeout on the process join, expose a hanging process
* make sure that teardown is always called
```
8321f682

[fix] Dead code removal for OSS (#276) · fb8d9137

Benjamin Lefaudeux authored Dec 29, 2020

* removing a dead call since ShardedDDP, small speedup
* unrelated, but filling in the changelog
* another nit

fb8d9137

29 Dec, 2020 2 commits
- [hotfix] Catching properly a given test failing if not enough gpus (#274) · 7abaa2be
  Benjamin Lefaudeux authored Dec 28, 2020
```
* catching properly a given test failing if not enough gpus
```
  7abaa2be
- [feature] OSS: add unit test for distributed checkpointing (#273) · 60c8de4a
  Joshua Meier authored Dec 28, 2020
```
author: Joshua Meier
```
  60c8de4a
28 Dec, 2020 2 commits
- [chore] Move all unit tests dist init to being file based (#272) · b640cab5
  Benjamin Lefaudeux authored Dec 28, 2020
```
* file based dist init
* nicer handling of broken world sizes vs. number of available GPUs, do not break but warn out
```
  b640cab5
- [doc] better ShardedGradScaler example (#271) · 290afecd
  Benjamin Lefaudeux authored Dec 27, 2020
  
  290afecd
24 Dec, 2020 1 commit

[chore] Update changelog (#268) · 18455bf0

Min Xu authored Dec 23, 2020

* Update changelog

missed this item from previous AdaScale commit.

* More change log

* Addressed review comments

18455bf0

22 Dec, 2020 2 commits

[fix] CircleCI vs pip hotfix (#267) · 381d28ca
Benjamin Lefaudeux authored Dec 22, 2020
```
* keep two torch 1.7 profiles to save cuda 10.1 testing
```
381d28ca

[OSS] Balance the trainable params only (#262) · c386e937

Benjamin Lefaudeux authored Dec 21, 2020

* fix, one liner

* adjust so that frozen trunks get spread still, even if this should have little consequences

* removing dead code, hopeful unit test fix

* now with some linting..

* adding a proper unit test case

c386e937

19 Dec, 2020 1 commit

[OSS] Getting rid of the "should bucket" hash table, just use a list + non... · ca74ee22

Benjamin Lefaudeux authored Dec 19, 2020

[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259)

* Getting rid of the "should bucket" hash table, just use a list
Properly handle all params, with or without requires_grad

* make sure that this case is unit tested

ca74ee22

17 Dec, 2020 3 commits
- [fix] grad scaler optional process group (#257) · bd7e25a5
  Benjamin Lefaudeux authored Dec 17, 2020
  
  bd7e25a5
- [fix] OSS - resolve fp16 overflow in clip grad norm (#263) · 2df5ca2d
  Joshua Meier authored Dec 17, 2020
  
  2df5ca2d
- [fix] OSS - typo + small perf fix (#256) · 2d9243bf
  Benjamin Lefaudeux authored Dec 16, 2020
```
* typo, sorry about that

* small perf fix
```
  2d9243bf
16 Dec, 2020 5 commits

[perf] ShardedDDP: better handling of the callback queue, try to consume it as we go. (#254) · 351f35e1
Benjamin Lefaudeux authored Dec 16, 2020
```
* Better handling of the callback queue, try to consume it as we go.

* dumping buckets for the reduce part, always the same unused params issue
```
351f35e1

[docs] lintfixes (#255) · 19cb5938

Benjamin Lefaudeux authored Dec 16, 2020



* lintfixes

* come on black

* Update tutorial_pipe_multiprocess.py

make RANK global like the other tutorials
Co-authored-by: Vittorio Caggiano <caggiano@gmail.com>

19cb5938

[doc] Update README.md (#244) · 550f1ab7

VitaliyLi authored Dec 16, 2020



* Update README.md

* Update README.md

update capitalization
Co-authored-by: Vittorio Caggiano <caggiano@gmail.com>

550f1ab7

[feat] add CPU support to tutorials in examples + factorize tutorials (#247) · 02478eb3

jessijzhao authored Dec 15, 2020

* [feat] add CPU support to tutorials in examples

* now works on a machine without cuda
* fixes some minor typos

* [cleanup] factorize tutorials in examples

* collects duplicate code across tutorials in helpers.py

* [fix] getData in tutorials now returns iterable

02478eb3

[fix] solutions to recent pip's isolation failing to build from source (#249) · 7e5ddcd2
Stas Bekman authored Dec 15, 2020

7e5ddcd2