1. 02 Apr, 2021 1 commit
  2. 25 Mar, 2021 1 commit
  3. 24 Mar, 2021 1 commit
  4. 18 Mar, 2021 1 commit
  5. 16 Mar, 2021 2 commits
    • Conglong Li's avatar
      1-bit Adam v2 (#817) · 68c8481b
      Conglong Li authored
      
      
      Authors: @awan-10 @conglongli @samyam @jeffra
      
      What's new:
      
      NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
      Add support to momentum masks for those parameters with constant zero gradients during training.
      Bug fixes (e.g., #813).
      
      * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)
      
      * NCCL based 1-bit Implementation + Refactor to add communication backends (#593)
      
      * add nccl 1-bit optim.
      
      * temporary commit to save stuff.
      
      * Use dist collectives instead of mpi routines.
      
      * remove old code for comm.
      
      * Fix bugs. still does not work.
      
      * modify to test the nccl side code path
      
      * Initial gather impl. Works intra-node.
      
      * Updates to comm. phase 2. nccl comm. passed the tests.
      
      * refactor code to introduce nccl/mpi as backends for onebit adam.
      
      * Refactor updates to test/engine.
      
      * Fix compile/runtime errors.
      
      * simplify support for nccl/mpi backends.
      
      * Add missign file
      
      * Add compression backend in constructor. Revert later.
      
      * modify test with some perf counting.
      
      * Implement a true non-blocking gather for nccl side.
      
      * Revert "Add compression backend in constructor. Revert later."
      
      This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f.
      
      * improve the 1-bit adam test.
      
      * Refactor comm. and compression backend in 1-bit adam.
      
      * Fix the test.
      
      * Fix runtime errors and typos in nccl backend
      
      * fix mpi backend. modify tests.
      
      * modify nccl perf test.
      
      * fix mpi side errors.
      
      * Add an mpi perf test
      
      * Sync DSE.
      
      * Remove old collectives file.
      
      * Undo a typo.
      
      * Graceful failure for torch versions that don't support nccl pt2pt.
      
      * Revert "Merge branch 'master' into staging-1bit-nccl-v2"
      
      This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing
      changes made to a6dba72aeafad63661dfe566d3accd03d00be78c.
      
      * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""
      
      This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47.
      
      * comm optimization + 1-bit lamb
      
      * Saving/debugging commit.
      
      * finalizing 1-bit lamb
      
      * finalizing 1-bit lamb
      
      * add momentum mask and chkpt handling for 1-bit adam
      
      * Cleanup and modify nccl test to be runnable with deepspeed launcher.
      
      * Fix format.
      
      * fix formatting again.
      
      * make test runnable without mpi4py
      
      * Add dist.alltoall and dist.allgather instead of custom functions.
      
      * remove debug prints.
      
      * formatting and renaming
      
      * renaming
      
      * renaming
      
      * add unit test, fix existing tests
      
      * skip unit test when torch < 1.8
      
      * revert 1-bit lamb
      
      * flatten momentum when dimension is more than 1
      
      * add warning message for 1-bit adam under fp32
      
      * improve version check
      
      * add fp32 test
      
      * 1-bit adam doc
      
      * fix file name
      
      * doc fix
      
      * torch 1.8 is released
      
      * doc fix
      
      * fix tests
      
      * update news
      
      * add doc for momentum mask
      
      * fix checkpoing handling, add unit test
      
      * checkpoint handling doc
      
      * doc final cleanup
      
      * bump dates
      
      * update tests
      
      * url change
      
      * doc fix
      
      * fix test
      
      * doc update
      Co-authored-by: default avatarAmmar Ahmad Awan <ammar.awan@microsoft.com>
      Co-authored-by: default avatarJeff Rasley <jerasley@microsoft.com>
      68c8481b
    • brett koonce's avatar
      docs: minor spelling tweaks (#858) · 547d1c5f
      brett koonce authored
      547d1c5f
  6. 14 Mar, 2021 1 commit
  7. 11 Mar, 2021 2 commits
  8. 08 Mar, 2021 3 commits
  9. 26 Feb, 2021 1 commit
  10. 16 Feb, 2021 1 commit
    • Olatunji Ruwase's avatar
      Minor doc tweaks (#761) · c28a71f9
      Olatunji Ruwase authored
      * Fix docstring
      
      * Make screenshots clickable for easier viewing
      
      * Navigation menu in alphabetical order; More clicable screenshots
      
      * Rename 1Cycle doc
      
      * Tweak naming
      c28a71f9
  11. 12 Feb, 2021 1 commit
  12. 11 Feb, 2021 2 commits
    • Conglong Li's avatar
      1-bit Adam documentation fix (#747) · 248f6383
      Conglong Li authored
      
      
      * 1-bit adam doc fix
      
      * 1-bit adam doc fix
      
      * 1-bit adam doc fix
      Co-authored-by: default avatarJeff Rasley <jerasley@microsoft.com>
      248f6383
    • Cheng Li's avatar
      Add flops profiler tutorial (#682) · e2dfe0d1
      Cheng Li authored
      * work on flops profiler tutorial
      
      * update flops profiler tutorial
      
      * add flops profiler tutorial and fix names
      
      * work on flops profiler tutorial
      
      * update flops profiler tutorial
      
      * add flops profiler tutorial and fix names
      
      * fix tailing ws
      
      * fix names
      
      * remove multistep profiling and update docs
      
      * fix cases where functionals and submodules coexist in a parent module, update readme
      
      * fix typo
      
      * always invoke post hook function
      
      * fix module flops sum and update tests
      
      * update tutorial
      e2dfe0d1
  13. 10 Feb, 2021 2 commits
  14. 20 Jan, 2021 1 commit
  15. 08 Jan, 2021 1 commit
  16. 06 Jan, 2021 1 commit
  17. 05 Jan, 2021 1 commit
  18. 18 Dec, 2020 1 commit
  19. 07 Dec, 2020 1 commit
  20. 02 Dec, 2020 1 commit
  21. 28 Nov, 2020 1 commit
    • Stas Bekman's avatar
      [doc] typo fix and clarification (#563) · 17f36f1b
      Stas Bekman authored
      This PR:
      * fixes a misspelled method name
      * also `( () )` doesn't read too well, until one reads the code and understands that it's not a formatting bug. I proposed to simply say that it's a callable object.
      17f36f1b
  22. 12 Nov, 2020 2 commits
  23. 11 Nov, 2020 1 commit
  24. 10 Nov, 2020 2 commits
  25. 09 Nov, 2020 1 commit
  26. 12 Oct, 2020 1 commit
  27. 07 Oct, 2020 1 commit
  28. 25 Sep, 2020 2 commits
  29. 24 Sep, 2020 2 commits
  30. 17 Sep, 2020 1 commit