1. 29 Mar, 2021 1 commit
  2. 28 Mar, 2021 1 commit
  3. 19 Mar, 2021 2 commits
  4. 04 Mar, 2021 1 commit
  5. 01 Mar, 2021 1 commit
    • Min Xu's avatar
      [chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7
      Min Xu authored
      * [chores]: CI py39 on GPU and more efficiency
      
      * add test list files
      
      * fix
      
      * add test list files
      
      * split benchmark run into 2 runs
      
      * fix 1.8 version and balance benchmarks
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * recording tests
      
      * py39 install fix
      
      * test again
      
      * move tests
      
      * reorg tests
      
      * skip tests for torch 1.8 due to an upstream bug
      
      * removed __init__.py from tests since it confuses pytest
      
      * Revert "removed __init__.py from tests since it confuses pytest"
      
      This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.
      
      * don't include __init__ in file list
      
      * notes on __init__.py and added missing ones
      
      * fixed mypy in a test file
      
      * balance test runtime
      
      * better pip install
      
      * balance more
      
      * pip fix
      
      * balance
      
      * balance more, all test should finish within 20m now
      
      * minor license update
      
      * trying cu102
      
      * more doc and addressed Ben's comments
      
      * debugging
      
      * debugging...
      5eb6b8c7
  6. 26 Feb, 2021 1 commit
    • anj-s's avatar
      [feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d
      anj-s authored
      
      
      * clean start
      
      * removing per layer split strategy, probably not that useful indeed
      
      * initial transformer benchmark
      
      * hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224
      
      * proper cuda streams and device, something off in terms of mems consumption
      
      * minor, stashing
      
      * unit test fix
      
      * removing all the distributed parts
      
      * simpler test, needs debugging
      
      * working OOP, running a model which does not fit on the gpu memory
      
      * spring cleaning
      
      * removing the ill-advised optimizer bits, better keep that orthogonal
      
      * [offload] Add support for activation offloading + other changes (#367)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * avoid saving inputs
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * [offload] Add support for fp16 training (#374)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * add support for fp16
      
      * add unit tests
      
      * fix lint errors
      
      * fix test failure
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * [offload] Add support for activation checkpointing for all layers. (#381)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * add support for fp16
      
      * add unit tests
      
      * fix lint errors
      
      * fix test failure
      
      * cp work, incorrect output dimensions still need to be fixed
      
      * fixed activation outputs
      
      * intermediate cp of work
      
      * add tests
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * add support for microbatches
      
      * revert benchmark config changes
      
      * add parametrization
      
      * fix lint errors and tests
      
      * skip test for 1.5
      
      * fix lint errors
      
      * skip test if there are no GPUs
      
      * fix lint errors
      
      * fix lint errors
      
      * move experimental to the fairscale repo
      
      * lint error fixes
      
      * modify test imports
      
      * lint error fixes
      
      * move offload files to the experimental directory
      
      * move tests and benchmarks to their forlder
      
      * fix mypy errors
      
      * cp intermediate working benchmarks
      
      * more changes
      
      * split benchmark configs
      
      * remove print statements
      
      * fix lint errors
      
      * remove unused print
      
      * stress testing
      
      * remove unused file
      
      * change param nae
      
      * lint fixes
      
      * move file to the right folder
      
      * offload_experimental
      
      * add doc string
      
      * add error message
      Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
      Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      f7813d6d
  7. 24 Feb, 2021 1 commit