1. 28 Oct, 2020 1 commit
  2. 17 Sep, 2020 1 commit
    • Tom Birch's avatar
      Multi-process pipe (#90) · 63f7796a
      Tom Birch authored
      Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
      * Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
      * Added support for lazy construction of modules (see lazy_construction for an example)
      * Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
      * Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess
      63f7796a
  3. 03 Sep, 2020 1 commit
    • Jun Ru Anderson's avatar
      Add grad scaler (#48) · b6a5e634
      Jun Ru Anderson authored
      
      
      Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge.
      Co-authored-by: default avatarJun Ru Anderson <andersonic@fb.com>
      b6a5e634
  4. 28 Aug, 2020 1 commit
    • Jun Ru Anderson's avatar
      [test] specify chunks for pipe/transformer benchmark (#52) · d1d74413
      Jun Ru Anderson authored
      
      
      * specify chunks for pipe/transformer benchmark
      
      Set chunks to be equal to len(balance) for pipe/transformer benchmark. Will update words per second and memory usage checks in next commit (must test on CircleCI to find appropriate values)
      
      * change benchmark words per second and memory usage
      
      Did six runs for words-per-second, with results: 9144.40, 9163.91, 9993.01, 9082.82, 9155.09, 9000.67
      Peak allocated bytes per device (which does not change between runs) were 193206272, 645632, 562688, 92688384 for devices 0, 1, 2 and 3, respectively
      
      * increase batch size
      
      batch size was small enough that the GPU's computing power was not the bottleneck, slowing training and specifically making more chunks slower. Increasing batch size has therefore increased training speed
      
      * update benchmark numbers
      
      ran six times, with wps 36917.44, 36797.65, 37006.03, 36872.84, 37129.31, 37003.31 and peak allocated bytes 4061909504, 4050944, 10427392, 2031824896 for devices 0,1,2 and 3 respectively.
      Co-authored-by: default avatarJun Ru Anderson <andersonic@fb.com>
      d1d74413
  5. 22 Aug, 2020 1 commit
  6. 21 Aug, 2020 1 commit
  7. 18 Aug, 2020 1 commit
  8. 14 Aug, 2020 1 commit
  9. 31 Jul, 2020 3 commits
  10. 08 Jul, 2020 1 commit