"vscode:/vscode.git/clone" did not exist on "55c695bb69f42537c8e643d1fa9dd75c30521e33"
- 31 Mar, 2021 2 commits
-
-
Jeff Rasley authored
-
Jeff Rasley authored
-
- 16 Mar, 2021 1 commit
-
-
Conglong Li authored
Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., #813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing changes made to a6dba72aeafad63661dfe566d3accd03d00be78c. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by:
Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com>
-
- 08 Mar, 2021 1 commit
-
-
Jeff Rasley authored
-
- 11 Feb, 2021 1 commit
-
-
Conglong Li authored
* 1-bit adam doc fix * 1-bit adam doc fix * 1-bit adam doc fix Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
- 19 Jan, 2021 1 commit
-
-
Jeff Rasley authored
* Update README.md * Update index.md
-
- 09 Dec, 2020 1 commit
-
-
Jeff Rasley authored
-
- 12 Nov, 2020 1 commit
-
-
Jeff Rasley authored
Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Reza Yazdani <reyazda@microsoft.com>
-
- 10 Nov, 2020 1 commit
-
-
Olatunji Ruwase authored
* Progressive layer dropping docs (#499) * test * Adding tutorial and news page for pld * updating the tutorial and posts of PLD * update the finetune tutorial * Update PLD tutorial (#512) * Update installation instructions * Format fix * ZeRO tutorial * Format fixes * ZeRO-Offload * ZeRO and ZeRO-Offload tutorials * Update navigation page * Format fixes * Add yuxhe feedback * Fix blog post link * Fix OneBit-Adam link Tweak scheduler example * Fix date link * Add DeepSpeed_Adam * Add PLD tutorial to navigation Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> * updating the pld docs * DeepSpeed implementation of PLD (#508) * DeepSpeed implementation of PLD * Format fixes * Formatting fixes * Fix broken url * Address PR feedback * Bump DSE Co-authored-by:
Minjia Zhang <33713995+minjiaz@users.noreply.github.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Minjia Zhang <minjiaz@microsoft.com>
-
- 09 Nov, 2020 2 commits
-
-
Olatunji Ruwase authored
* PLD documentation * Formatting fixes * Fix url bug
-
Olatunji Ruwase authored
* PLD documentation * Formatting fixes
-
- 19 Oct, 2020 1 commit
-
-
Shaden Smith authored
-
- 10 Sep, 2020 4 commits
-
-
Shaden Smith authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Jeff Rasley authored
Co-authored-by:
Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com>
-
Minjia Zhang authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Jeff Rasley authored
* ZeRO-Offload (squash) (#381) Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Reza Yazdani <reyazda@microsoft.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Jie <37380896+jren73@users.noreply.github.com> Co-authored-by:
Arash Ashari <arashari@microsoft.com> Co-authored-by:
Reza Yazdani <reyazda@microsoft.com> Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com> Co-authored-by:
RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by:
Reza Yazdani <reyazda@microsoft.com> Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com>
-
- 25 Jul, 2020 1 commit
-
-
Shaden Smith authored
-
- 20 Jun, 2020 1 commit
-
-
Shaden Smith authored
-
- 04 Jun, 2020 1 commit
-
-
Shaden Smith authored
* links and formatting
-
- 19 May, 2020 1 commit
-
-
Jeff Rasley authored
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS Co-authored-by:
Tunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by:
Elton Zheng <eltonz@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
yuxionghe <yuxhe@microsoft.com> Co-authored-by:
Arash Ashari <arashari@microsoft.com>
-
- 18 Mar, 2020 2 commits
-
-
Shaden Smith authored
-
Shaden Smith authored
* Add coming soon to posts * Add what's new section to main page
-
- 17 Mar, 2020 1 commit
-
-
Shaden Smith authored
-
- 03 Mar, 2020 1 commit
-
-
Jeff Rasley authored
* add support for deepspeed env file to pass custom env values * simplify deepspeed config example
-
- 27 Feb, 2020 1 commit
-
-
Jeff Rasley authored
* add text about mpirun
-
- 24 Feb, 2020 1 commit
-
-
Shaden Smith authored
-
- 13 Feb, 2020 1 commit
-
-
Rahul Prasad authored
-
- 11 Feb, 2020 1 commit
-
-
Gaurav Menghani authored
* Fix broken link for the 1Cycle doc. * Removed the 1Cycle link from README.md.
-
- 10 Feb, 2020 9 commits
-
-
Shaden Smith authored
-
Shaden Smith authored
-
Shaden Smith authored
* Importing 1Cycle tutorial. * image paths * Added LR schedule figure * line wrap * lowercase name * Updating README links * typo
-
sheikheddy authored
-
Jeff Rasley authored
-
Shaden Smith authored
-
Jeff Rasley authored
-
Shaden Smith authored
* Increasing section headers * Move testing under contributing Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Shaden Smith authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
- 09 Feb, 2020 3 commits
-
-
Shaden Smith authored
-
Jeff Rasley authored
-
Jeff Rasley authored
-