1. 01 Apr, 2021 1 commit
  2. 31 Mar, 2021 2 commits
  3. 18 Mar, 2021 1 commit
  4. 17 Mar, 2021 1 commit
  5. 12 Mar, 2021 1 commit
  6. 10 Mar, 2021 1 commit
  7. 09 Mar, 2021 1 commit
  8. 08 Mar, 2021 1 commit
  9. 05 Mar, 2021 1 commit
  10. 04 Mar, 2021 1 commit
  11. 03 Mar, 2021 1 commit
  12. 01 Mar, 2021 1 commit
    • Min Xu's avatar
      [chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7
      Min Xu authored
      * [chores]: CI py39 on GPU and more efficiency
      
      * add test list files
      
      * fix
      
      * add test list files
      
      * split benchmark run into 2 runs
      
      * fix 1.8 version and balance benchmarks
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * recording tests
      
      * py39 install fix
      
      * test again
      
      * move tests
      
      * reorg tests
      
      * skip tests for torch 1.8 due to an upstream bug
      
      * removed __init__.py from tests since it confuses pytest
      
      * Revert "removed __init__.py from tests since it confuses pytest"
      
      This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.
      
      * don't include __init__ in file list
      
      * notes on __init__.py and added missing ones
      
      * fixed mypy in a test file
      
      * balance test runtime
      
      * better pip install
      
      * balance more
      
      * pip fix
      
      * balance
      
      * balance more, all test should finish within 20m now
      
      * minor license update
      
      * trying cu102
      
      * more doc and addressed Ben's comments
      
      * debugging
      
      * debugging
      
      * better capture the errors
      
      * debugging
      
      * fix pyenv command
      
      * add universe repo
      
      * update to cuda 11 for 171
      
      * add a test file, improved the checking script
      5eb6b8c7
  13. 26 Feb, 2021 1 commit
    • anj-s's avatar
      [feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d
      anj-s authored
      
      
      * clean start
      
      * removing per layer split strategy, probably not that useful indeed
      
      * initial transformer benchmark
      
      * hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224
      
      * proper cuda streams and device, something off in terms of mems consumption
      
      * minor, stashing
      
      * unit test fix
      
      * removing all the distributed parts
      
      * simpler test, needs debugging
      
      * working OOP, running a model which does not fit on the gpu memory
      
      * spring cleaning
      
      * removing the ill-advised optimizer bits, better keep that orthogonal
      
      * [offload] Add support for activation offloading + other changes (#367)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * avoid saving inputs
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * [offload] Add support for fp16 training (#374)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * add support for fp16
      
      * add unit tests
      
      * fix lint errors
      
      * fix test failure
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * [offload] Add support for activation checkpointing for all layers. (#381)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * add support for fp16
      
      * add unit tests
      
      * fix lint errors
      
      * fix test failure
      
      * cp work, incorrect output dimensions still need to be fixed
      
      * fixed activation outputs
      
      * intermediate cp of work
      
      * add tests
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * add support for microbatches
      
      * revert benchmark config changes
      
      * add parametrization
      
      * fix lint errors and tests
      
      * skip test for 1.5
      
      * fix lint errors
      
      * skip test if there are no GPUs
      
      * fix lint errors
      
      * fix lint errors
      
      * move experimental to the fairscale repo
      
      * lint error fixes
      
      * modify test imports
      
      * lint error fixes
      
      * move offload files to the experimental directory
      
      * move tests and benchmarks to their forlder
      
      * fix mypy errors
      
      * cp intermediate working benchmarks
      
      * more changes
      
      * split benchmark configs
      
      * remove print statements
      
      * fix lint errors
      
      * remove unused print
      
      * stress testing
      
      * remove unused file
      
      * change param nae
      
      * lint fixes
      
      * move file to the right folder
      
      * offload_experimental
      
      * add doc string
      
      * add error message
      Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
      Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      f7813d6d
  14. 24 Feb, 2021 2 commits
  15. 23 Feb, 2021 1 commit
  16. 04 Feb, 2021 1 commit
  17. 03 Feb, 2021 2 commits
  18. 29 Jan, 2021 1 commit
  19. 27 Jan, 2021 1 commit
  20. 25 Jan, 2021 1 commit
    • anj-s's avatar
      [refactor] Add benchmark config object and validation function (#314) · 331aed2c
      anj-s authored
      
      
      * [refactor]Remove unused variables and refactor common configurations
      
      * move helper function to call site
      
      * fixed lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix import order
      
      * format files
      
      * remove unused imports
      
      * fix lint errors
      
      * fix lint errors
      
      * refactor common utilities
      
      * address PR comments
      
      * sorted imports
      
      * add space
      
      * modify comment
      
      * added doc strings and addressed PR comments.
      
      * addressed PR comments
      
      * added another comment to clarify.
      
      * fixing lint errors
      
      * addressed PR comments
      
      * addressed PR comments
      
      * fixed typos
      
      * initialize var
      
      * rename seq_pred to lm
      
      * fix lint errors
      
      * move datasets and models into separate folders
      
      * add the folders created
      
      * fix lint errors
      
      * create golden config to stats mapping
      
      * add common batching for both synthetic and real data
      
      * fixed lint errors
      
      * enable real pipe benchmakrs with new golden data
      
      * reduce seq len to avoid OOM
      
      * updated golden data
      
      * add logging
      
      * add golden data
      
      * add golden data
      
      * fix lint errors
      
      * add doc string
      
      * remove unused class
      
      * add seq len and batch size to the config
      
      * remove commented out line
      
      * address comments
      
      * rename imports
      
      * refactor common logic in dataloaders
      
      * add golden configs
      
      * lint changes
      
      * merge latest changes
      
      * lint errors
      
      * address PR comments
      
      * initial refactoring
      
      * lint fixes
      
      * fix lint errors
      
      * update comment
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      331aed2c
  21. 23 Jan, 2021 1 commit
  22. 21 Jan, 2021 2 commits
    • Benjamin Lefaudeux's avatar
      8a49a748
    • anj-s's avatar
      [refactor] Add batch size to the golden benchmark configs. (#313) · 81841734
      anj-s authored
      
      
      * [refactor]Remove unused variables and refactor common configurations
      
      * move helper function to call site
      
      * fixed lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix import order
      
      * format files
      
      * remove unused imports
      
      * fix lint errors
      
      * fix lint errors
      
      * refactor common utilities
      
      * address PR comments
      
      * sorted imports
      
      * add space
      
      * modify comment
      
      * added doc strings and addressed PR comments.
      
      * addressed PR comments
      
      * added another comment to clarify.
      
      * fixing lint errors
      
      * addressed PR comments
      
      * addressed PR comments
      
      * fixed typos
      
      * initialize var
      
      * rename seq_pred to lm
      
      * fix lint errors
      
      * move datasets and models into separate folders
      
      * add the folders created
      
      * fix lint errors
      
      * create golden config to stats mapping
      
      * add common batching for both synthetic and real data
      
      * fixed lint errors
      
      * enable real pipe benchmakrs with new golden data
      
      * reduce seq len to avoid OOM
      
      * updated golden data
      
      * add logging
      
      * add golden data
      
      * add golden data
      
      * fix lint errors
      
      * add doc string
      
      * remove unused class
      
      * add seq len and batch size to the config
      
      * remove commented out line
      
      * address comments
      
      * rename imports
      
      * refactor common logic in dataloaders
      
      * add golden configs
      
      * lint changes
      
      * merge latest changes
      
      * lint errors
      
      * address PR comments
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      81841734
  23. 19 Jan, 2021 1 commit
    • anj-s's avatar
      [refactor] Enable benchmarks/pipe.py and merge real and synthetic input pipeline. (#286) · 44b9bcd8
      anj-s authored
      
      
      * [refactor]Remove unused variables and refactor common configurations
      
      * move helper function to call site
      
      * fixed lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix import order
      
      * format files
      
      * remove unused imports
      
      * fix lint errors
      
      * fix lint errors
      
      * refactor common utilities
      
      * address PR comments
      
      * sorted imports
      
      * add space
      
      * modify comment
      
      * added doc strings and addressed PR comments.
      
      * addressed PR comments
      
      * added another comment to clarify.
      
      * fixing lint errors
      
      * addressed PR comments
      
      * addressed PR comments
      
      * fixed typos
      
      * initialize var
      
      * rename seq_pred to lm
      
      * fix lint errors
      
      * move datasets and models into separate folders
      
      * add the folders created
      
      * fix lint errors
      
      * create golden config to stats mapping
      
      * add common batching for both synthetic and real data
      
      * fixed lint errors
      
      * enable real pipe benchmakrs with new golden data
      
      * reduce seq len to avoid OOM
      
      * updated golden data
      
      * add logging
      
      * add golden data
      
      * add golden data
      
      * fix lint errors
      
      * add doc string
      
      * remove commented out line
      
      * address comments
      
      * rename imports
      
      * refactor common logic in dataloaders
      
      * add golden configs
      
      * lint changes
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      44b9bcd8
  24. 04 Jan, 2021 1 commit
    • anj-s's avatar
      [refactor] Modify train and benchmark functions to account for multiple models and datasets. (#260) · 656fc319
      anj-s authored
      
      
      * [refactor]Remove unused variables and refactor common configurations
      
      * move helper function to call site
      
      * fixed lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix import order
      
      * format files
      
      * remove unused imports
      
      * fix lint errors
      
      * fix lint errors
      
      * refactor common utilities
      
      * address PR comments
      
      * sorted imports
      
      * add space
      
      * modify comment
      
      * added doc strings and addressed PR comments.
      
      * addressed PR comments
      
      * added another comment to clarify.
      
      * fixing lint errors
      
      * addressed PR comments
      
      * addressed PR comments
      
      * fixed typos
      
      * initialize var
      
      * rename seq_pred to lm
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      656fc319
  25. 30 Dec, 2020 2 commits
    • anj-s's avatar
      [refactor] Remove unused variables, add configuration objects and basic... · 3c727ec5
      anj-s authored
      
      [refactor] Remove unused variables, add configuration objects and basic cleanup for pipe benchmarks. (#252)
      
      * [refactor]Remove unused variables and refactor common configurations
      
      * move helper function to call site
      
      * fixed lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix import order
      
      * format files
      
      * remove unused imports
      
      * fix lint errors
      
      * address PR comments
      
      * sorted imports
      
      * add space
      
      * modify comment
      
      * added doc strings and addressed PR comments.
      
      * addressed PR comments
      
      * added another comment to clarify.
      
      * fixing lint errors
      
      * rename variable
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      3c727ec5
    • Benjamin Lefaudeux's avatar
      [fix] Dead code removal for OSS (#276) · fb8d9137
      Benjamin Lefaudeux authored
      * removing a dead call since ShardedDDP, small speedup
      * unrelated, but filling in the changelog
      * another nit
      fb8d9137
  26. 16 Dec, 2020 1 commit
  27. 01 Dec, 2020 1 commit
  28. 22 Nov, 2020 1 commit
  29. 21 Nov, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] ShardedDataParallel with autoreduce (#157) · ad933b34
      Benjamin Lefaudeux authored
      * rewrite using autograd and Variable execution queue to make the reduce automatic
      * share buckets with OSS to remove duplication
      * some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up
      ad933b34
  30. 19 Nov, 2020 2 commits
  31. 18 Nov, 2020 1 commit
  32. 16 Nov, 2020 1 commit
  33. 12 Nov, 2020 1 commit
  34. 10 Nov, 2020 1 commit
    • Tom Birch's avatar
      Single-process control via PipeRPCWrapper (#156) · 5d4f50fb
      Tom Birch authored
      Adds support for:
      * Reused layers (e.g. for weight sharing)
      * Lazily-constructed layers
      * Single-process control via PipeRPCWrapper
      * PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive
      
      Also added examples for multi-process and PipeRPCWrapper
      5d4f50fb