• anj-s's avatar
    [feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d
    anj-s authored
    
    
    * clean start
    
    * removing per layer split strategy, probably not that useful indeed
    
    * initial transformer benchmark
    
    * hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224
    
    * proper cuda streams and device, something off in terms of mems consumption
    
    * minor, stashing
    
    * unit test fix
    
    * removing all the distributed parts
    
    * simpler test, needs debugging
    
    * working OOP, running a model which does not fit on the gpu memory
    
    * spring cleaning
    
    * removing the ill-advised optimizer bits, better keep that orthogonal
    
    * [offload] Add support for activation offloading + other changes (#367)
    
    * initial fwd/bwd commit
    
    * checkpoint work
    
    * modify shard loop
    
    * activation offloading and test to start with
    
    * fix lint errors
    
    * update comments
    
    * fix lint
    
    * remove unused var
    
    * remove commented out lines
    
    * modify name
    
    * remove break
    
    * remove profiler comments
    
    * avoid saving inputs
    
    * fix lint errors
    Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
    
    * [offload] Add support for fp16 training (#374)
    
    * initial fwd/bwd commit
    
    * checkpoint work
    
    * modify shard loop
    
    * activation offloading and test to start with
    
    * fix lint errors
    
    * update comments
    
    * fix lint
    
    * remove unused var
    
    * remove commented out lines
    
    * modify name
    
    * remove break
    
    * remove profiler comments
    
    * add support for fp16
    
    * add unit tests
    
    * fix lint errors
    
    * fix test failure
    Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
    
    * [offload] Add support for activation checkpointing for all layers. (#381)
    
    * initial fwd/bwd commit
    
    * checkpoint work
    
    * modify shard loop
    
    * activation offloading and test to start with
    
    * fix lint errors
    
    * update comments
    
    * fix lint
    
    * remove unused var
    
    * remove commented out lines
    
    * modify name
    
    * remove break
    
    * remove profiler comments
    
    * add support for fp16
    
    * add unit tests
    
    * fix lint errors
    
    * fix test failure
    
    * cp work, incorrect output dimensions still need to be fixed
    
    * fixed activation outputs
    
    * intermediate cp of work
    
    * add tests
    
    * fix lint errors
    Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
    
    * add support for microbatches
    
    * revert benchmark config changes
    
    * add parametrization
    
    * fix lint errors and tests
    
    * skip test for 1.5
    
    * fix lint errors
    
    * skip test if there are no GPUs
    
    * fix lint errors
    
    * fix lint errors
    
    * move experimental to the fairscale repo
    
    * lint error fixes
    
    * modify test imports
    
    * lint error fixes
    
    * move offload files to the experimental directory
    
    * move tests and benchmarks to their forlder
    
    * fix mypy errors
    
    * cp intermediate working benchmarks
    
    * more changes
    
    * split benchmark configs
    
    * remove print statements
    
    * fix lint errors
    
    * remove unused print
    
    * stress testing
    
    * remove unused file
    
    * change param nae
    
    * lint fixes
    
    * move file to the right folder
    
    * offload_experimental
    
    * add doc string
    
    * add error message
    Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
    Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
    Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
    f7813d6d
offload.py 15.1 KB