1. 22 Nov, 2020 1 commit
  2. 21 Nov, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] ShardedDataParallel with autoreduce (#157) · ad933b34
      Benjamin Lefaudeux authored
      * rewrite using autograd and Variable execution queue to make the reduce automatic
      * share buckets with OSS to remove duplication
      * some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up
      ad933b34
  3. 19 Nov, 2020 2 commits
  4. 18 Nov, 2020 1 commit
  5. 16 Nov, 2020 1 commit
  6. 12 Nov, 2020 1 commit
  7. 10 Nov, 2020 1 commit
    • Tom Birch's avatar
      Single-process control via PipeRPCWrapper (#156) · 5d4f50fb
      Tom Birch authored
      Adds support for:
      * Reused layers (e.g. for weight sharing)
      * Lazily-constructed layers
      * Single-process control via PipeRPCWrapper
      * PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive
      
      Also added examples for multi-process and PipeRPCWrapper
      5d4f50fb
  8. 06 Nov, 2020 1 commit
  9. 28 Oct, 2020 1 commit
  10. 23 Oct, 2020 1 commit
  11. 21 Oct, 2020 1 commit
  12. 20 Oct, 2020 1 commit
  13. 17 Oct, 2020 1 commit
  14. 14 Oct, 2020 1 commit
  15. 10 Oct, 2020 1 commit
  16. 09 Oct, 2020 1 commit
  17. 06 Oct, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] OSS/SDP : bucketing (#122) · 341d8b2b
      Benjamin Lefaudeux authored
      Same bucketing strategy for OSS and SDP:
      sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed
      341d8b2b
  18. 29 Sep, 2020 1 commit
  19. 24 Sep, 2020 1 commit
  20. 22 Sep, 2020 2 commits
  21. 17 Sep, 2020 2 commits
    • Tom Birch's avatar
      Multi-process pipe (#90) · 63f7796a
      Tom Birch authored
      Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
      * Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
      * Added support for lazy construction of modules (see lazy_construction for an example)
      * Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
      * Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess
      63f7796a
    • Benjamin Lefaudeux's avatar
      [feat] Sharded DDP - small refactor and new features (#97) · 49a198c9
      Benjamin Lefaudeux authored
      - rename oss_ddp to ShardedDataParallel
      - some refactoring
      - ShardedDataParallel owns the sharded optimizer, exposed if need be
      - some small perf bumps
      49a198c9
  22. 16 Sep, 2020 1 commit
  23. 09 Sep, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] OSS flatten state dict (#65) · 4f597233
      Benjamin Lefaudeux authored
      Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading
      4f597233
  24. 03 Sep, 2020 3 commits
    • Benjamin Lefaudeux's avatar
      [feat] Add a memory usage regression test to the OSS benchmark (#62) · ee38e1e0
      Benjamin Lefaudeux authored
      * Aligning the optimizer state dict with what PyTorch expects
      
      * Adding a check on the dict keys, ensure that `state` and `param_groups` are there
      
      * after installing the specific isort, black and all, one liner to please the linter..
      
      * Adding some measurement of the memory consumption while training + checkpointing
      
      * mandatory lintfix commit
      
      * brainfart, reset the memory use counter at the beginning of the training in case two of them are run in a row
      
      * move reset stats call, hotfix
      
      * move the optimizer to rmsprop, more stateful and still used in CV
      
      * trying to figure out a sigsev in circleci
      ee38e1e0
    • Jun Ru Anderson's avatar
      Add grad scaler (#48) · b6a5e634
      Jun Ru Anderson authored
      
      
      Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge.
      Co-authored-by: default avatarJun Ru Anderson <andersonic@fb.com>
      b6a5e634
    • Benjamin Lefaudeux's avatar
      [fix] OSS pytorch-compliant state dict (#61) · 1d1d15ea
      Benjamin Lefaudeux authored
      * Aligning the optimizer state dict with what PyTorch expects
      
      * Adding a check on the dict keys, ensure that `state` and `param_groups` are there
      
      * after installing the specific isort, black and all, one liner to please the linter..
      1d1d15ea
  25. 28 Aug, 2020 1 commit
    • Jun Ru Anderson's avatar
      [test] specify chunks for pipe/transformer benchmark (#52) · d1d74413
      Jun Ru Anderson authored
      
      
      * specify chunks for pipe/transformer benchmark
      
      Set chunks to be equal to len(balance) for pipe/transformer benchmark. Will update words per second and memory usage checks in next commit (must test on CircleCI to find appropriate values)
      
      * change benchmark words per second and memory usage
      
      Did six runs for words-per-second, with results: 9144.40, 9163.91, 9993.01, 9082.82, 9155.09, 9000.67
      Peak allocated bytes per device (which does not change between runs) were 193206272, 645632, 562688, 92688384 for devices 0, 1, 2 and 3, respectively
      
      * increase batch size
      
      batch size was small enough that the GPU's computing power was not the bottleneck, slowing training and specifically making more chunks slower. Increasing batch size has therefore increased training speed
      
      * update benchmark numbers
      
      ran six times, with wps 36917.44, 36797.65, 37006.03, 36872.84, 37129.31, 37003.31 and peak allocated bytes 4061909504, 4050944, 10427392, 2031824896 for devices 0,1,2 and 3 respectively.
      Co-authored-by: default avatarJun Ru Anderson <andersonic@fb.com>
      d1d74413
  26. 22 Aug, 2020 1 commit
  27. 21 Aug, 2020 2 commits
    • Benjamin Lefaudeux's avatar
      [feat] Simple macro OSS benchmark (#47) · 46c3776b
      Benjamin Lefaudeux authored
      
      
      * initial commit, dummy training loop, pure pytorch but not DDP
      
      * probably slightly broken, but rough DDP benchmark run
      
      * adding the torchvision requirement for testing
      
      * brainfart
      
      * reduce the loss, do something slightly distributed
      
      * Some cleanup, distributing the training on two GPUs
      
      * some cleanup + adding a vanilla run, still not good to go
      
      * less silly defaults, gtg for a start I think
      
      * smaller batch to fit the smaller gpus used in the circleci rigs
      
      * Adding some options for the benchmark, and regression testing
      
      * [test] set torch seed for Adam tests (#49)
      
      Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict.
      Co-authored-by: default avatarJun Ru Anderson <andersonic@fb.com>
      
      * linting, I really need to automate this isort insanity
      Co-authored-by: default avatarJun Ru Anderson <33384298+andersonic@users.noreply.github.com>
      Co-authored-by: default avatarJun Ru Anderson <andersonic@fb.com>
      46c3776b
    • Jun Ru Anderson's avatar
      [test] set torch seed for Adam tests (#49) · 0e8c2a96
      Jun Ru Anderson authored
      
      
      Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict.
      Co-authored-by: default avatarJun Ru Anderson <andersonic@fb.com>
      0e8c2a96
  28. 18 Aug, 2020 1 commit
  29. 14 Aug, 2020 1 commit
  30. 31 Jul, 2020 3 commits
  31. 08 Jul, 2020 1 commit