Commits · 2d3d5a7bb7340963383afd5b4e9a0b53e1238c35 · OpenDAS / fairscale

01 Apr, 2021 1 commit
- [feat] remove old MultiProcessPipe (#563) · 2d3d5a7b
  msbaines authored Apr 01, 2021
  
  2d3d5a7b
31 Mar, 2021 2 commits
- [feat] experimental: Add xpipe support (#553) · e141a93e
  Siddharth Goyal authored Mar 31, 2021
  
  e141a93e
- [offload] Audit OffloadModel API, add error messages and remove redundant code path. (#557) · 34384e1b
  anj-s authored Mar 31, 2021
```
* renaming/adding error messages

* address comments

* address comments

* add more comments

* add more comments
```
  34384e1b
18 Mar, 2021 1 commit
- [fix] super minor, but make sure that the mem leak does not come back (#536) · f7e6680b
  Benjamin Lefaudeux authored Mar 18, 2021
  
  f7e6680b
17 Mar, 2021 1 commit

[offload] Add support for multiple streams and fix issue with integer inputs. (#515) · 39a12a8b

anj-s authored Mar 17, 2021



* debugging statements

* fix index inputs and streams

* fix lint errors

* remove print

* lint errors

* address comments

* lint error
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

39a12a8b

12 Mar, 2021 1 commit
- [chore] update to torch v1.8.0 (#508) · c79bbd01
  msbaines authored Mar 11, 2021
  
  c79bbd01
10 Mar, 2021 1 commit
- [feat] experimental: Add spectrain support (#372) · 5e8a6422
  Siddharth Goyal authored Mar 09, 2021
```
* experimental: Add spectrain support

* Address review comments

* Address review comments
```
  5e8a6422
09 Mar, 2021 1 commit

[refactor] Fix for using synthetic data + remove unused flags (#485) · 8eaa3622

anj-s authored Mar 09, 2021



* smal fix, remove unused flags

* remove usused flag

* add back max_batch flag

* adding back lazy_construction

* adding back lazy_construction

* add missing device arg
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

8eaa3622

08 Mar, 2021 1 commit
- [chore] OSS perf test, super minor (#495) · 886aa327
  Benjamin Lefaudeux authored Mar 08, 2021
  
  886aa327
05 Mar, 2021 1 commit
- [fix] OSS speed ref adjusted down.. (#480) · 80cc7559
  Benjamin Lefaudeux authored Mar 05, 2021
```
:(
```
  80cc7559
04 Mar, 2021 1 commit
- [fix] Cache MNIST fetchs, use alternative URLs (#465) · 0491715f
  Benjamin Lefaudeux authored Mar 03, 2021
  
  0491715f
03 Mar, 2021 1 commit

[refactor] Use logging in place of print statements, remove unused functions... · 7a3199b1

anj-s authored Mar 02, 2021

[refactor] Use logging in place of print statements, remove unused functions and other minor refactoring changes. (#461)

* fix pipe logging and other cleanups

* more log/debug changes

7a3199b1

01 Mar, 2021 1 commit

[chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7

Min Xu authored Mar 01, 2021

* [chores]: CI py39 on GPU and more efficiency

* add test list files

* fix

* add test list files

* split benchmark run into 2 runs

* fix 1.8 version and balance benchmarks

* fix

* fix

* fix

* fix

* recording tests

* py39 install fix

* test again

* move tests

* reorg tests

* skip tests for torch 1.8 due to an upstream bug

* removed __init__.py from tests since it confuses pytest

* Revert "removed __init__.py from tests since it confuses pytest"

This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.

* don't include __init__ in file list

* notes on __init__.py and added missing ones

* fixed mypy in a test file

* balance test runtime

* better pip install

* balance more

* pip fix

* balance

* balance more, all test should finish within 20m now

* minor license update

* trying cu102

* more doc and addressed Ben's comments

* debugging

* debugging

* better capture the errors

* debugging

* fix pyenv command

* add universe repo

* update to cuda 11 for 171

* add a test file, improved the checking script

5eb6b8c7

26 Feb, 2021 1 commit

[feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d

anj-s authored Feb 25, 2021



* clean start

* removing per layer split strategy, probably not that useful indeed

* initial transformer benchmark

* hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224

* proper cuda streams and device, something off in terms of mems consumption

* minor, stashing

* unit test fix

* removing all the distributed parts

* simpler test, needs debugging

* working OOP, running a model which does not fit on the gpu memory

* spring cleaning

* removing the ill-advised optimizer bits, better keep that orthogonal

* [offload] Add support for activation offloading + other changes (#367)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* avoid saving inputs

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* [offload] Add support for fp16 training (#374)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* add support for fp16

* add unit tests

* fix lint errors

* fix test failure
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* [offload] Add support for activation checkpointing for all layers. (#381)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* add support for fp16

* add unit tests

* fix lint errors

* fix test failure

* cp work, incorrect output dimensions still need to be fixed

* fixed activation outputs

* intermediate cp of work

* add tests

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* add support for microbatches

* revert benchmark config changes

* add parametrization

* fix lint errors and tests

* skip test for 1.5

* fix lint errors

* skip test if there are no GPUs

* fix lint errors

* fix lint errors

* move experimental to the fairscale repo

* lint error fixes

* modify test imports

* lint error fixes

* move offload files to the experimental directory

* move tests and benchmarks to their forlder

* fix mypy errors

* cp intermediate working benchmarks

* more changes

* split benchmark configs

* remove print statements

* fix lint errors

* remove unused print

* stress testing

* remove unused file

* change param nae

* lint fixes

* move file to the right folder

* offload_experimental

* add doc string

* add error message
Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

f7813d6d

24 Feb, 2021 2 commits

[refactor] Modify folder locations for tests/ to mirror source code tree. (#419) · 3b0717eb

anj-s authored Feb 24, 2021



* refactor experimental file locations

* refactor fix

* disable test temporarily

* lint error fix

* make the change in the right file

* fix lint errors

* skip failing tests
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

3b0717eb

split benchmark configs (#420) · b89365e6
anj-s authored Feb 23, 2021

b89365e6

23 Feb, 2021 1 commit

[refactor] Move experimental folder to the fairscale repo (#410) · 045a9743

anj-s authored Feb 22, 2021



* move experimental to the fairscale repo

* lint error fixes

* modify test imports

* lint error fixes

* lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

045a9743

04 Feb, 2021 1 commit
- [refactor] multiprocess_pipe: remove pipelined_backward (#362) · 42e44149
  msbaines authored Feb 04, 2021
  
  42e44149
03 Feb, 2021 2 commits

[feat][minor] OSS Benchmark - regression test + background testing new optims (#352) · de713d1e
Benjamin Lefaudeux authored Feb 03, 2021
```
* restoring the regression test, adding a test of the for_each optims
* fix the regression test on circleci
* removing unused flags
```
de713d1e

[refactor] Refactor and enable multiprocess nn.Pipe benchmarks. (#319) · cd186441

anj-s authored Feb 03, 2021



* mp cleanup

* round of multiprocess refactoring

* test golden run

* print cuda stats

* fix lint errors

* enable multiprocess pipe benchmarks

* set world size to be available gpus

* more changes

* use synthetic loaders for intermediate pipeline stages

* merged master

* fix for the devices property

* dataloader fix

* modify rank check

* print wps stats

* enable verification

* fix logging

* fix flag name

* fix flag name

* check for rank

* fix indent

* pass args

* pass args

* modify golden data

* remove unused print messsage

* fix lint errors

* add comments

* fix benchmarks
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

cd186441

29 Jan, 2021 1 commit
- [refactor] make AsyncPipe its own class (#341) · eaee5976
  msbaines authored Jan 29, 2021
  
  eaee5976
27 Jan, 2021 1 commit
- [refactor] pipe: separate out Single and MultiProcess pipe (#326) · cae9b638
  msbaines authored Jan 26, 2021
  
  cae9b638
25 Jan, 2021 1 commit

[refactor] Add benchmark config object and validation function (#314) · 331aed2c

anj-s authored Jan 25, 2021



* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* fix lint errors

* refactor common utilities

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* addressed PR comments

* addressed PR comments

* fixed typos

* initialize var

* rename seq_pred to lm

* fix lint errors

* move datasets and models into separate folders

* add the folders created

* fix lint errors

* create golden config to stats mapping

* add common batching for both synthetic and real data

* fixed lint errors

* enable real pipe benchmakrs with new golden data

* reduce seq len to avoid OOM

* updated golden data

* add logging

* add golden data

* add golden data

* fix lint errors

* add doc string

* remove unused class

* add seq len and batch size to the config

* remove commented out line

* address comments

* rename imports

* refactor common logic in dataloaders

* add golden configs

* lint changes

* merge latest changes

* lint errors

* address PR comments

* initial refactoring

* lint fixes

* fix lint errors

* update comment
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

331aed2c

23 Jan, 2021 1 commit

[feat] Add AMPnet implementation in experimental dir (#304) · 14491030

Siddharth Goyal authored Jan 22, 2021

* Add AMPnet implementation (clean version)

* Move ampnet to experimental

* Move stuff around pipeline

* Address review comments and fix pre-commit errors

* Refactor and modify delegate functionality

* Modify header in pipe.py

14491030

21 Jan, 2021 2 commits

[feat] Enabling ViT in OSS benchmarks (#322) · 8a49a748
Benjamin Lefaudeux authored Jan 21, 2021

8a49a748

[refactor] Add batch size to the golden benchmark configs. (#313) · 81841734

anj-s authored Jan 21, 2021



* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* fix lint errors

* refactor common utilities

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* addressed PR comments

* addressed PR comments

* fixed typos

* initialize var

* rename seq_pred to lm

* fix lint errors

* move datasets and models into separate folders

* add the folders created

* fix lint errors

* create golden config to stats mapping

* add common batching for both synthetic and real data

* fixed lint errors

* enable real pipe benchmakrs with new golden data

* reduce seq len to avoid OOM

* updated golden data

* add logging

* add golden data

* add golden data

* fix lint errors

* add doc string

* remove unused class

* add seq len and batch size to the config

* remove commented out line

* address comments

* rename imports

* refactor common logic in dataloaders

* add golden configs

* lint changes

* merge latest changes

* lint errors

* address PR comments
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

81841734

19 Jan, 2021 1 commit

[refactor] Enable benchmarks/pipe.py and merge real and synthetic input pipeline. (#286) · 44b9bcd8

anj-s authored Jan 19, 2021



* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* fix lint errors

* refactor common utilities

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* addressed PR comments

* addressed PR comments

* fixed typos

* initialize var

* rename seq_pred to lm

* fix lint errors

* move datasets and models into separate folders

* add the folders created

* fix lint errors

* create golden config to stats mapping

* add common batching for both synthetic and real data

* fixed lint errors

* enable real pipe benchmakrs with new golden data

* reduce seq len to avoid OOM

* updated golden data

* add logging

* add golden data

* add golden data

* fix lint errors

* add doc string

* remove commented out line

* address comments

* rename imports

* refactor common logic in dataloaders

* add golden configs

* lint changes
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

44b9bcd8

04 Jan, 2021 1 commit

[refactor] Modify train and benchmark functions to account for multiple models and datasets. (#260) · 656fc319

anj-s authored Jan 04, 2021



* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* fix lint errors

* refactor common utilities

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* addressed PR comments

* addressed PR comments

* fixed typos

* initialize var

* rename seq_pred to lm

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

656fc319

30 Dec, 2020 2 commits

[refactor] Remove unused variables, add configuration objects and basic... · 3c727ec5

anj-s authored Dec 29, 2020


[refactor] Remove unused variables, add configuration objects and basic cleanup for pipe benchmarks. (#252)

* [refactor]Remove unused variables and refactor common configurations

* move helper function to call site

* fixed lint errors

* fix lint errors

* fix lint errors

* fix lint errors

* fix import order

* format files

* remove unused imports

* fix lint errors

* address PR comments

* sorted imports

* add space

* modify comment

* added doc strings and addressed PR comments.

* addressed PR comments

* added another comment to clarify.

* fixing lint errors

* rename variable
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

3c727ec5

[fix] Dead code removal for OSS (#276) · fb8d9137

Benjamin Lefaudeux authored Dec 29, 2020

* removing a dead call since ShardedDDP, small speedup
* unrelated, but filling in the changelog
* another nit

fb8d9137

16 Dec, 2020 1 commit

[feat] add CPU support to tutorials in examples + factorize tutorials (#247) · 02478eb3

jessijzhao authored Dec 15, 2020

* [feat] add CPU support to tutorials in examples

* now works on a machine without cuda
* fixes some minor typos

* [cleanup] factorize tutorials in examples

* collects duplicate code across tutorials in helpers.py

* [fix] getData in tutorials now returns iterable

02478eb3

01 Dec, 2020 1 commit
- [chore] Refactor unit testing, shared utils (#218) · e83da060
  Benjamin Lefaudeux authored Dec 01, 2020
  
  e83da060
22 Nov, 2020 1 commit

[fix] More robust stats for regression testing (#204) · 2b121242

Benjamin Lefaudeux authored Nov 22, 2020

* testing median and MAD

* synchronize on kernels to make sure that we're measuring the actual completion time

* adjusting the circleci threshold, not that the speed has regressed but because we measure proper cuda execution time

2b121242

21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

19 Nov, 2020 2 commits
- [fix] Reverting a change which slipped in #188 (#198) · ba367d39
  Benjamin Lefaudeux authored Nov 18, 2020
```
* reverting a change which slipped in #188
```
  ba367d39
- [feat] Add CPU support for pipe.py benchmarks (#188) · a842a927
  Yuanyuan (Ana) Shen authored Nov 18, 2020
```
* Add CPU support for pipe.py benchmarks, CUDA-free
```
  a842a927
18 Nov, 2020 1 commit

[feat] ShardedOptim: Distributed Grad Scaler (for torch AMP) (#182) · d85acf72

Benjamin Lefaudeux authored Nov 17, 2020

* adding a shard-aware GradScaler wrap, credits to Sean Naren for the idea
* adding stubs & explanations in the documentation

d85acf72

16 Nov, 2020 1 commit
- [feat] OSS-aware clip grads, bridge sharded states (#167) · ade312c4
  Benjamin Lefaudeux authored Nov 16, 2020
```
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
```
  ade312c4
12 Nov, 2020 1 commit
- [fix] Pure cpu support for benchmarks/oss.py (#185) · 2fe93203
  Yuanyuan (Ana) Shen authored Nov 12, 2020
```
* now works on a machine without cuda, easier to debug and quick test
```
  2fe93203
10 Nov, 2020 1 commit

Single-process control via PipeRPCWrapper (#156) · 5d4f50fb

Tom Birch authored Nov 10, 2020

Adds support for:
* Reused layers (e.g. for weight sharing)
* Lazily-constructed layers
* Single-process control via PipeRPCWrapper
* PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive

Also added examples for multi-process and PipeRPCWrapper

5d4f50fb