1. 01 Mar, 2021 3 commits
    • Min Xu's avatar
      [chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7
      Min Xu authored
      * [chores]: CI py39 on GPU and more efficiency
      
      * add test list files
      
      * fix
      
      * add test list files
      
      * split benchmark run into 2 runs
      
      * fix 1.8 version and balance benchmarks
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * recording tests
      
      * py39 install fix
      
      * test again
      
      * move tests
      
      * reorg tests
      
      * skip tests for torch 1.8 due to an upstream bug
      
      * removed __init__.py from tests since it confuses pytest
      
      * Revert "removed __init__.py from tests since it confuses pytest"
      
      This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.
      
      * don't include __init__ in file list
      
      * notes on __init__.py and added missing ones
      
      * fixed mypy in a test file
      
      * balance test runtime
      
      * better pip install
      
      * balance more
      
      * pip fix
      
      * balance
      
      * balance more, all test should finish within 20m now
      
      * minor license update
      
      * trying cu102
      
      * more doc and addressed Ben's comments
      
      * debugging
      
      * debugging...
      5eb6b8c7
    • Min Xu's avatar
      [test] FSDP: add the failing test for #421 (#453) · 5ecac15a
      Min Xu authored
      
      
      * [test] FSDP: add the failing test for #421
      
      * skip on 1.5
      
      * better skipping
      
      * Update tests/nn/data_parallel/test_fsdp_grad_scaler.py
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      5ecac15a
    • Sean Naren's avatar
  2. 27 Feb, 2021 3 commits
  3. 26 Feb, 2021 7 commits
    • Myle Ott's avatar
    • Min Xu's avatar
      update readme (#439) · 93d115c6
      Min Xu authored
      93d115c6
    • Myle Ott's avatar
    • Vittorio Caggiano's avatar
      Update README.md (#438) · 9163e381
      Vittorio Caggiano authored
      * Update README.md
      9163e381
    • Min Xu's avatar
      parallelize tests on GPUs (#436) · 2fa0b1e7
      Min Xu authored
      2fa0b1e7
    • Min Xu's avatar
      [feat]: add summon_full_params context mgr (#433) · 77f92b38
      Min Xu authored
      * [feat]: add summon_full_params context mgr
      
      * fix
      
      * fix
      
      * addressed comments
      
      * fixed the state_dict copy
      
      * lint
      77f92b38
    • anj-s's avatar
      [feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d
      anj-s authored
      
      
      * clean start
      
      * removing per layer split strategy, probably not that useful indeed
      
      * initial transformer benchmark
      
      * hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224
      
      * proper cuda streams and device, something off in terms of mems consumption
      
      * minor, stashing
      
      * unit test fix
      
      * removing all the distributed parts
      
      * simpler test, needs debugging
      
      * working OOP, running a model which does not fit on the gpu memory
      
      * spring cleaning
      
      * removing the ill-advised optimizer bits, better keep that orthogonal
      
      * [offload] Add support for activation offloading + other changes (#367)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * avoid saving inputs
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * [offload] Add support for fp16 training (#374)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * add support for fp16
      
      * add unit tests
      
      * fix lint errors
      
      * fix test failure
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * [offload] Add support for activation checkpointing for all layers. (#381)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * add support for fp16
      
      * add unit tests
      
      * fix lint errors
      
      * fix test failure
      
      * cp work, incorrect output dimensions still need to be fixed
      
      * fixed activation outputs
      
      * intermediate cp of work
      
      * add tests
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * add support for microbatches
      
      * revert benchmark config changes
      
      * add parametrization
      
      * fix lint errors and tests
      
      * skip test for 1.5
      
      * fix lint errors
      
      * skip test if there are no GPUs
      
      * fix lint errors
      
      * fix lint errors
      
      * move experimental to the fairscale repo
      
      * lint error fixes
      
      * modify test imports
      
      * lint error fixes
      
      * move offload files to the experimental directory
      
      * move tests and benchmarks to their forlder
      
      * fix mypy errors
      
      * cp intermediate working benchmarks
      
      * more changes
      
      * split benchmark configs
      
      * remove print statements
      
      * fix lint errors
      
      * remove unused print
      
      * stress testing
      
      * remove unused file
      
      * change param nae
      
      * lint fixes
      
      * move file to the right folder
      
      * offload_experimental
      
      * add doc string
      
      * add error message
      Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
      Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      f7813d6d
  4. 25 Feb, 2021 3 commits
  5. 24 Feb, 2021 4 commits
  6. 23 Feb, 2021 11 commits
  7. 22 Feb, 2021 1 commit
  8. 19 Feb, 2021 4 commits
  9. 18 Feb, 2021 3 commits
  10. 17 Feb, 2021 1 commit