- 18 Mar, 2021 1 commit
-
-
Stas Bekman authored
* consistent checkpoint filenaming * backward compatible rename Co-authored-by:Olatunji Ruwase <olruwase@microsoft.com>
-
- 16 Mar, 2021 7 commits
-
-
Conglong Li authored
Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., #813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file *...
-
Jeff Rasley authored
-
Olatunji Ruwase authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Olatunji Ruwase authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Jeff Rasley authored
-
brett koonce authored
-
Stas Bekman authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
- 15 Mar, 2021 2 commits
-
-
Olatunji Ruwase authored
* Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Samyam Rajbhandari authored
* Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com>
-
- 14 Mar, 2021 1 commit
-
-
Stas Bekman authored
Admin merging for pure-doc PR that does not trigger build.
-
- 12 Mar, 2021 3 commits
-
-
Cheng Li authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Jeff Rasley authored
-
Stas Bekman authored
* fix log(0) & 1/log(1) bugs * simplify Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by:
Cheng Li <pistasable@gmail.com>
-
- 11 Mar, 2021 5 commits
-
-
Olatunji Ruwase authored
* Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Stas Bekman authored
-
Cheng Li authored
* add optimizers and schedules to rtd * update ds website and fix links * add optimizers and schedules to rtd * update ds website and fix links * add flops profiler to rtd * fix Co-authored-by:Shaden Smith <Shaden.Smith@microsoft.com>
-
Stas Bekman authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Jeff Rasley authored
-
- 10 Mar, 2021 1 commit
-
-
Shaden Smith authored
-
- 09 Mar, 2021 2 commits
-
-
Jeff Rasley authored
-
Jeff Rasley authored
-
- 08 Mar, 2021 7 commits
-
-
Samyam Rajbhandari authored
-
Jeff Rasley authored
-
Jeff Rasley authored
-
Jeff Rasley authored
-
Jeff Rasley authored
-
Samyam Rajbhandari authored
* Squash stage3 v1 (#146) Co-authored-by:
Samyam <samyamr@microsoft.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by:
eltonzheng <eltonz@microsoft.com> * Fix correctness bug (#147) * formatting fix (#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (#151) * fp16 Z3 API update and bugfix * revert debug change * ZeRO-3 detach and race condition bugfixes (#149) * trying out ZeRO-3 race condition fix * CUDA sync instead of stream * reduction stream sync * remove commented code * Fix optimizer state_dict KeyError (#148) Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> * fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152) * Simplifying the logic for getting averaged gradients (#153) * skip for now * Z3 Docs redux (#154) * removing some TODOs and commented code (#155) * New Z3 defaults (#156) Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> * formatting * megatron external params Co-authored-by:
Jeff Rasley <jerasley@microsoft.com> Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by:
Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by:
Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by:
eltonzheng <eltonz@microsoft.com>
-
Olatunji Ruwase authored
-
- 03 Mar, 2021 1 commit
-
-
Reza Yazdani authored
* fixing buffers in transformer kernel when gelu-checkpoint is enabled * fixing the test issue for other memory optimization flags * fixing a bug for when attn_dropout_checkpoint is enabled
-
- 28 Feb, 2021 1 commit
-
-
zmx authored
hi, i take a look at the code of column_sum_reduce, i have 2 questions: 1. the goal of column_sum_reduce is to get the column sum of inp matrix with shape[rows, width] and the result shape should be [width],right ? It seems that the judgment condition of pos is not suitable 2. the implementation of cuda kernel based on the asumption that, the thread with same threadIdx.y will group into a thread_block_tile, the blockDim is (32,32), i read the nvidia document https://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf , THREAD BLOCK TILE is a subset of threads of a thread block, divided into tiles in row-major order. doesn't it mean thread with the same threadIdx.x will group into a thread_block_tile ? thanks !!!! Co-authored-by:
Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
- 27 Feb, 2021 1 commit
-
-
vfdev authored
-
- 26 Feb, 2021 3 commits
-
-
Stas Bekman authored
-
Reza Yazdani authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
vfdev authored
-
- 24 Feb, 2021 2 commits
-
-
Reza Yazdani authored
* fix the bias-add precision and indexing and also adding the layer-norm-eps as a configurable parameter for transformer * add ACC_HALF config * use defined to check if ACC_Half is defined
-
Reza Yazdani authored
-
- 20 Feb, 2021 1 commit
-
-
Stas Bekman authored
Invalid param name Thanks.
-
- 19 Feb, 2021 1 commit
-
-
Jeff Rasley authored
-
- 18 Feb, 2021 1 commit
-
-
Reza Yazdani authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-