1. 08 May, 2021 1 commit
  2. 05 Apr, 2021 1 commit
  3. 17 Mar, 2021 1 commit
  4. 04 Mar, 2021 1 commit
  5. 26 Feb, 2021 1 commit
    • anj-s's avatar
      [feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d
      anj-s authored
      
      
      * clean start
      
      * removing per layer split strategy, probably not that useful indeed
      
      * initial transformer benchmark
      
      * hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224
      
      * proper cuda streams and device, something off in terms of mems consumption
      
      * minor, stashing
      
      * unit test fix
      
      * removing all the distributed parts
      
      * simpler test, needs debugging
      
      * working OOP, running a model which does not fit on the gpu memory
      
      * spring cleaning
      
      * removing the ill-advised optimizer bits, better keep that orthogonal
      
      * [offload] Add support for activation offloading + other changes (#367)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * avoid saving inputs
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * [offload] Add support for fp16 training (#374)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * add support for fp16
      
      * add unit tests
      
      * fix lint errors
      
      * fix test failure
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * [offload] Add support for activation checkpointing for all layers. (#381)
      
      * initial fwd/bwd commit
      
      * checkpoint work
      
      * modify shard loop
      
      * activation offloading and test to start with
      
      * fix lint errors
      
      * update comments
      
      * fix lint
      
      * remove unused var
      
      * remove commented out lines
      
      * modify name
      
      * remove break
      
      * remove profiler comments
      
      * add support for fp16
      
      * add unit tests
      
      * fix lint errors
      
      * fix test failure
      
      * cp work, incorrect output dimensions still need to be fixed
      
      * fixed activation outputs
      
      * intermediate cp of work
      
      * add tests
      
      * fix lint errors
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      
      * add support for microbatches
      
      * revert benchmark config changes
      
      * add parametrization
      
      * fix lint errors and tests
      
      * skip test for 1.5
      
      * fix lint errors
      
      * skip test if there are no GPUs
      
      * fix lint errors
      
      * fix lint errors
      
      * move experimental to the fairscale repo
      
      * lint error fixes
      
      * modify test imports
      
      * lint error fixes
      
      * move offload files to the experimental directory
      
      * move tests and benchmarks to their forlder
      
      * fix mypy errors
      
      * cp intermediate working benchmarks
      
      * more changes
      
      * split benchmark configs
      
      * remove print statements
      
      * fix lint errors
      
      * remove unused print
      
      * stress testing
      
      * remove unused file
      
      * change param nae
      
      * lint fixes
      
      * move file to the right folder
      
      * offload_experimental
      
      * add doc string
      
      * add error message
      Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
      Co-authored-by: default avatarBenjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      f7813d6d
  6. 24 Feb, 2021 1 commit
  7. 03 Feb, 2021 1 commit
    • anj-s's avatar
      [refactor] Refactor and enable multiprocess nn.Pipe benchmarks. (#319) · cd186441
      anj-s authored
      
      
      * mp cleanup
      
      * round of multiprocess refactoring
      
      * test golden run
      
      * print cuda stats
      
      * fix lint errors
      
      * enable multiprocess pipe benchmarks
      
      * set world size to be available gpus
      
      * more changes
      
      * use synthetic loaders for intermediate pipeline stages
      
      * merged master
      
      * fix for the devices property
      
      * dataloader fix
      
      * modify rank check
      
      * print wps stats
      
      * enable verification
      
      * fix logging
      
      * fix flag name
      
      * fix flag name
      
      * check for rank
      
      * fix indent
      
      * pass args
      
      * pass args
      
      * modify golden data
      
      * remove unused print messsage
      
      * fix lint errors
      
      * add comments
      
      * fix benchmarks
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      cd186441
  8. 21 Jan, 2021 1 commit
    • anj-s's avatar
      [refactor] Add batch size to the golden benchmark configs. (#313) · 81841734
      anj-s authored
      
      
      * [refactor]Remove unused variables and refactor common configurations
      
      * move helper function to call site
      
      * fixed lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix import order
      
      * format files
      
      * remove unused imports
      
      * fix lint errors
      
      * fix lint errors
      
      * refactor common utilities
      
      * address PR comments
      
      * sorted imports
      
      * add space
      
      * modify comment
      
      * added doc strings and addressed PR comments.
      
      * addressed PR comments
      
      * added another comment to clarify.
      
      * fixing lint errors
      
      * addressed PR comments
      
      * addressed PR comments
      
      * fixed typos
      
      * initialize var
      
      * rename seq_pred to lm
      
      * fix lint errors
      
      * move datasets and models into separate folders
      
      * add the folders created
      
      * fix lint errors
      
      * create golden config to stats mapping
      
      * add common batching for both synthetic and real data
      
      * fixed lint errors
      
      * enable real pipe benchmakrs with new golden data
      
      * reduce seq len to avoid OOM
      
      * updated golden data
      
      * add logging
      
      * add golden data
      
      * add golden data
      
      * fix lint errors
      
      * add doc string
      
      * remove unused class
      
      * add seq len and batch size to the config
      
      * remove commented out line
      
      * address comments
      
      * rename imports
      
      * refactor common logic in dataloaders
      
      * add golden configs
      
      * lint changes
      
      * merge latest changes
      
      * lint errors
      
      * address PR comments
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      81841734
  9. 19 Jan, 2021 1 commit
    • anj-s's avatar
      [refactor] Enable benchmarks/pipe.py and merge real and synthetic input pipeline. (#286) · 44b9bcd8
      anj-s authored
      
      
      * [refactor]Remove unused variables and refactor common configurations
      
      * move helper function to call site
      
      * fixed lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * fix import order
      
      * format files
      
      * remove unused imports
      
      * fix lint errors
      
      * fix lint errors
      
      * refactor common utilities
      
      * address PR comments
      
      * sorted imports
      
      * add space
      
      * modify comment
      
      * added doc strings and addressed PR comments.
      
      * addressed PR comments
      
      * added another comment to clarify.
      
      * fixing lint errors
      
      * addressed PR comments
      
      * addressed PR comments
      
      * fixed typos
      
      * initialize var
      
      * rename seq_pred to lm
      
      * fix lint errors
      
      * move datasets and models into separate folders
      
      * add the folders created
      
      * fix lint errors
      
      * create golden config to stats mapping
      
      * add common batching for both synthetic and real data
      
      * fixed lint errors
      
      * enable real pipe benchmakrs with new golden data
      
      * reduce seq len to avoid OOM
      
      * updated golden data
      
      * add logging
      
      * add golden data
      
      * add golden data
      
      * fix lint errors
      
      * add doc string
      
      * remove commented out line
      
      * address comments
      
      * rename imports
      
      * refactor common logic in dataloaders
      
      * add golden configs
      
      * lint changes
      Co-authored-by: default avatarAnjali Sridhar <anj@devfair0443.h2.fair>
      44b9bcd8