- 08 Mar, 2021 1 commit
-
-
Samyam Rajbhandari authored
* Squash stage3 v1 (#146) Co-authored-by:
Samyam <samyamr@microsoft.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by:
eltonzheng <eltonz@microsoft.com> * Fix correctness bug (#147) * formatting fix (#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (#151) * fp16 Z3 API update and bugfix * revert debug change * ZeRO-3 detach and race condition bugfixes (#149) * trying out ZeRO-3 race condition fix * CUDA sync instead of stream * reduction stream sync * remove commented code * Fix optimizer state_dict KeyError (#148) Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> * fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152) * Simplifying the logic for getting averaged gradients (#153) * skip for now * Z3 Docs redux (#154) * removing some TODOs and commented code (#155) * New Z3 defaults (#156) Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> * formatting * megatron external params Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by:
eltonzheng <eltonz@microsoft.com>
-
- 29 Jan, 2021 1 commit
-
-
Jeff Rasley authored
-
- 15 Jan, 2021 1 commit
-
-
Olatunji Ruwase authored
-
- 04 Jan, 2021 1 commit
-
-
Olatunji Ruwase authored
-
- 12 Nov, 2020 1 commit
-
-
Jeff Rasley authored
Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Reza Yazdani <reyazda@microsoft.com>
-
- 15 Sep, 2020 1 commit
-
-
Jeff Rasley authored
* add pytest skips around tests that require certain ops to be installed
-
- 10 Sep, 2020 2 commits
-
-
Jeff Rasley authored
-
Jeff Rasley authored
* ZeRO-Offload (squash) (#381) Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Reza Yazdani <reyazda@microsoft.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Jie <37380896+jren73@users.noreply.github.com> Co-authored-by:
Arash Ashari <arashari@microsoft.com> Co-authored-by:
Reza Yazdani <reyazda@microsoft.com> Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com> Co-authored-by:
RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by:
Reza Yazdani <reyazda@microsoft.com> Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com>
-
- 15 Jul, 2020 1 commit
-
-
Jeff Rasley authored
* empty grad fix * add unit tests for empty grad
-
- 11 Jul, 2020 1 commit
-
-
Jeff Rasley authored
* add amp support for deepspeed (non-ZeRO) * tests for amp mode
-
- 06 Jul, 2020 1 commit
-
-
Olatunji Ruwase authored
* Load non-DeepSpeed checkpoints into ZeRO optimizer * Handle parameters smaller than DP * Formatting fixes * Handle empty partitions * Fix perf bug Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
- 23 Jun, 2020 1 commit
-
-
Olatunji Ruwase authored
* Load non-DeepSpeed checkpoints into ZeRO optimizer * Handle parameters smaller than DP * Formatting fixes
-
- 27 May, 2020 1 commit
-
-
Jeff Rasley authored
* updates to support fp32 grad clipping and disable max_grad_norm
-
- 19 May, 2020 1 commit
-
-
Jeff Rasley authored
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS Co-authored-by:
Tunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by:
Elton Zheng <eltonz@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
yuxionghe <yuxhe@microsoft.com> Co-authored-by:
Arash Ashari <arashari@microsoft.com>
-
- 24 Apr, 2020 1 commit
-
-
Olatunji Ruwase authored
-
- 27 Mar, 2020 1 commit
-
-
Calogero Zarbo authored
* added zero_allow_untested_optimizer flag helpers * add zero_allow_untested_optimizer config constants * zero_allow_untested_optimizer logic with assertion * Added unit test and CustomOptimizer helper class
-
- 25 Mar, 2020 1 commit
-
-
Shaden Smith authored
-
- 10 Mar, 2020 1 commit
-
-
Olatunji Ruwase authored
* add tests cases for onecycle policy with fp16/zero * Make lr schedulers support fp16 optimizers * Fix formatting * More specific naming Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
- 26 Feb, 2020 1 commit
-
-
Jeff Rasley authored
* add auto-detect to torch dist init * update tests to infer distributed init status * prevent crash if dist_init_required is True but already initiliazed * only init if safe to do so (forgot to add this file in prev commit)
-
- 20 Feb, 2020 1 commit
-
-
Jeff Rasley authored
Also a fix for #94
-
- 15 Feb, 2020 1 commit
-
-
Jeff Rasley authored
bug fixes for adamw/lamb and corresponding tests
-