1. 26 May, 2022 1 commit
  2. 25 May, 2022 1 commit
  3. 02 Apr, 2021 2 commits
  4. 01 Apr, 2021 1 commit
  5. 31 Mar, 2021 3 commits
  6. 30 Mar, 2021 3 commits
  7. 27 Mar, 2021 2 commits
  8. 26 Mar, 2021 1 commit
  9. 25 Mar, 2021 1 commit
  10. 24 Mar, 2021 1 commit
  11. 18 Mar, 2021 2 commits
  12. 16 Mar, 2021 7 commits
    • Conglong Li's avatar
      1-bit Adam v2 (#817) · 68c8481b
      Conglong Li authored
      
      
      Authors: @awan-10 @conglongli @samyam @jeffra
      
      What's new:
      
      NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
      Add support to momentum masks for those parameters with constant zero gradients during training.
      Bug fixes (e.g., #813).
      
      * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)
      
      * NCCL based 1-bit Implementation + Refactor to add communication backends (#593)
      
      * add nccl 1-bit optim.
      
      * temporary commit to save stuff.
      
      * Use dist collectives instead of mpi routines.
      
      * remove old code for comm.
      
      * Fix bugs. still does not work.
      
      * modify to test the nccl side code path
      
      * Initial gather impl. Works intra-node.
      
      * Updates to comm. phase 2. nccl comm. passed the tests.
      
      * refactor code to introduce nccl/mpi as backends for onebit adam.
      
      * Refactor updates to test/engine.
      
      * Fix compile/runtime errors.
      
      * simplify support for nccl/mpi backends.
      
      * Add missign file
      
      * Add compression backend in constructor. Revert later.
      
      * modify test with some perf counting.
      
      * Implement a true non-blocking gather for nccl side.
      
      * Revert "Add compression backend in constructor. Revert later."
      
      This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f.
      
      * improve the 1-bit adam test.
      
      * Refactor comm. and compression backend in 1-bit adam.
      
      * Fix the test.
      
      * Fix runtime errors and typos in nccl backend
      
      * fix mpi backend. modify tests.
      
      * modify nccl perf test.
      
      * fix mpi side errors.
      
      * Add an mpi perf test
      
      * Sync DSE.
      
      * Remove old collectives file.
      
      * Undo a typo.
      
      * Graceful failure for torch versions that don't support nccl pt2pt.
      
      * Revert "Merge branch 'master' into staging-1bit-nccl-v2"
      
      This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing
      changes made to a6dba72aeafad63661dfe566d3accd03d00be78c.
      
      * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""
      
      This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47.
      
      * comm optimization + 1-bit lamb
      
      * Saving/debugging commit.
      
      * finalizing 1-bit lamb
      
      * finalizing 1-bit lamb
      
      * add momentum mask and chkpt handling for 1-bit adam
      
      * Cleanup and modify nccl test to be runnable with deepspeed launcher.
      
      * Fix format.
      
      * fix formatting again.
      
      * make test runnable without mpi4py
      
      * Add dist.alltoall and dist.allgather instead of custom functions.
      
      * remove debug prints.
      
      * formatting and renaming
      
      * renaming
      
      * renaming
      
      * add unit test, fix existing tests
      
      * skip unit test when torch < 1.8
      
      * revert 1-bit lamb
      
      * flatten momentum when dimension is more than 1
      
      * add warning message for 1-bit adam under fp32
      
      * improve version check
      
      * add fp32 test
      
      * 1-bit adam doc
      
      * fix file name
      
      * doc fix
      
      * torch 1.8 is released
      
      * doc fix
      
      * fix tests
      
      * update news
      
      * add doc for momentum mask
      
      * fix checkpoing handling, add unit test
      
      * checkpoint handling doc
      
      * doc final cleanup
      
      * bump dates
      
      * update tests
      
      * url change
      
      * doc fix
      
      * fix test
      
      * doc update
      Co-authored-by: default avatarAmmar Ahmad Awan <ammar.awan@microsoft.com>
      Co-authored-by: default avatarJeff Rasley <jerasley@microsoft.com>
      68c8481b
    • Jeff Rasley's avatar
      bump version 0.3.13 · 12a53b43
      Jeff Rasley authored
      12a53b43
    • Olatunji Ruwase's avatar
      7bcd72a2
    • Olatunji Ruwase's avatar
      Fix ZeRO3 save_checkpoint (#857) · fa87a73a
      Olatunji Ruwase authored
      
      Co-authored-by: default avatarJeff Rasley <jerasley@microsoft.com>
      fa87a73a
    • Jeff Rasley's avatar
      871f3048
    • brett koonce's avatar
      docs: minor spelling tweaks (#858) · 547d1c5f
      brett koonce authored
      547d1c5f
    • Stas Bekman's avatar
      24335d49
  13. 15 Mar, 2021 2 commits
  14. 14 Mar, 2021 1 commit
  15. 12 Mar, 2021 3 commits
  16. 11 Mar, 2021 5 commits
  17. 10 Mar, 2021 1 commit
  18. 09 Mar, 2021 2 commits
  19. 08 Mar, 2021 1 commit