1. 04 Feb, 2021 1 commit
  2. 03 Feb, 2021 2 commits
  3. 02 Feb, 2021 1 commit
  4. 30 Jan, 2021 1 commit
  5. 29 Jan, 2021 1 commit
  6. 27 Jan, 2021 1 commit
  7. 23 Jan, 2021 1 commit
  8. 21 Jan, 2021 3 commits
  9. 15 Jan, 2021 1 commit
  10. 11 Jan, 2021 1 commit
  11. 05 Jan, 2021 1 commit
    • Benjamin Lefaudeux's avatar
      [fix] Flaky tests (#283) · 79365ee6
      Benjamin Lefaudeux authored
      * adding the pytest timeout plugin to properly root out hanging tests
      * removing redundant code, slightly more reasonable timeout, works on single cuda
      * finding the root bug for some of the cpu hangs, rpc init
      * propagating all the rpc init test changes to the pipe and model parallel tests
      79365ee6
  12. 02 Jan, 2021 1 commit
  13. 30 Dec, 2020 1 commit
  14. 29 Dec, 2020 1 commit
  15. 28 Dec, 2020 1 commit
  16. 19 Dec, 2020 1 commit
  17. 10 Dec, 2020 1 commit
  18. 04 Dec, 2020 1 commit
  19. 01 Dec, 2020 2 commits
  20. 21 Nov, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] ShardedDataParallel with autoreduce (#157) · ad933b34
      Benjamin Lefaudeux authored
      * rewrite using autograd and Variable execution queue to make the reduce automatic
      * share buckets with OSS to remove duplication
      * some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up
      ad933b34
  21. 18 Nov, 2020 1 commit
  22. 11 Nov, 2020 2 commits
  23. 10 Nov, 2020 1 commit
    • Tom Birch's avatar
      Single-process control via PipeRPCWrapper (#156) · 5d4f50fb
      Tom Birch authored
      Adds support for:
      * Reused layers (e.g. for weight sharing)
      * Lazily-constructed layers
      * Single-process control via PipeRPCWrapper
      * PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive
      
      Also added examples for multi-process and PipeRPCWrapper
      5d4f50fb
  24. 30 Oct, 2020 1 commit
  25. 29 Oct, 2020 1 commit
  26. 23 Oct, 2020 1 commit
  27. 21 Oct, 2020 1 commit
  28. 20 Oct, 2020 1 commit
  29. 17 Oct, 2020 1 commit
  30. 16 Oct, 2020 2 commits
  31. 14 Oct, 2020 1 commit
  32. 08 Oct, 2020 2 commits
    • msbaines's avatar
      [feat] moe: initial implementation of MOELayer (#128) · 22ff665d
      msbaines authored
      Currently only implemented for a single process and expert.
      22ff665d
    • Min Xu's avatar
      [test] Add unittest for checkpoint & DDP (#126) · 6658be22
      Min Xu authored
      * Add unittest for checkpoint & DDP
      
      - this change adds test cases to reproduce the error with checkpoint & DDP
      - mandeep mentioned that there is also deadlock in this case, but this
        change doesn't cover that.
      - we cover cases where weight sharing is OK
      - however, same module multiple checkpoint or find_unused_parameters are
        both not OK
      
      * added norm checks
      6658be22
  33. 06 Oct, 2020 1 commit
    • Benjamin Lefaudeux's avatar
      [feat] OSS/SDP : bucketing (#122) · 341d8b2b
      Benjamin Lefaudeux authored
      Same bucketing strategy for OSS and SDP:
      sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed
      341d8b2b