- 30 May, 2023 2 commits
- 29 May, 2023 2 commits
- 11 May, 2023 3 commits
- 27 Apr, 2023 2 commits
- 26 Apr, 2023 3 commits
- 30 Mar, 2023 1 commit
-
-
aiss authored
-
- 10 Aug, 2022 1 commit
-
-
aiss authored
-
- 14 Jun, 2022 1 commit
-
-
aiss authored
-
- 11 Jun, 2022 4 commits
-
-
aiss authored
Merge branch 'deepspeed-0.6.3-rocm' of http://10.0.100.3/dcutoolkit/deeplearing/deepspeed into deepspeed-0.6.3-rocm version modify
-
aiss authored
-
aiss authored
-
aiss authored
-
- 26 May, 2022 1 commit
-
-
aiss authored
-
- 25 May, 2022 1 commit
-
-
aiss authored
-
- 02 Apr, 2021 2 commits
-
-
Ammar Ahmad Awan authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-
Jeff Rasley authored
This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.
-
- 01 Apr, 2021 1 commit
-
-
Stas Bekman authored
* zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by:Olatunji Ruwase <olruwase@microsoft.com>
-
- 31 Mar, 2021 3 commits
-
-
Jeff Rasley authored
-
Jeff Rasley authored
-
Jeff Rasley authored
-
- 30 Mar, 2021 3 commits
-
-
dependabot[bot] authored
Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1. - [Release notes](https://github.com/gettalong/kramdown/releases) - [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page) - [Commits](https://github.com/gettalong/kramdown/commits ) Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by:
Jeff Rasley <jerasley@microsoft.com>
-
Jeff Rasley authored
-
Jeff Rasley authored
security alert related to older kramdown version
-
- 27 Mar, 2021 2 commits
-
-
hamlet authored
* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in https://github.com/microsoft/DeepSpeed/issues/707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by:
Olatunji Ruwase <olruwase@microsoft.com>
-
Stas Bekman authored
Co-authored-by:Olatunji Ruwase <olruwase@microsoft.com>
-
- 26 Mar, 2021 1 commit
-
-
Stas Bekman authored
-
- 25 Mar, 2021 1 commit
-
-
Stas Bekman authored
* see_memory_usage fixes * didn't expect pt-1.2 * fix the order of things * fix the order of things
-
- 24 Mar, 2021 1 commit
-
-
Stas Bekman authored
* [doc] pipeline As @g-karthik flagged in https://github.com/microsoft/DeepSpeed/pull/659#discussion_r600132598 my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! * tweak
-
- 18 Mar, 2021 2 commits
-
-
Stas Bekman authored
As discussed in https://github.com/microsoft/DeepSpeed/issues/662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: https://github.com/microsoft/DeepSpeed/issues/662 Co-authored-by:
Jeff Rasley <jerasley@microsoft.com>
-
Stas Bekman authored
* consistent checkpoint filenaming * backward compatible rename Co-authored-by:Olatunji Ruwase <olruwase@microsoft.com>
-
- 16 Mar, 2021 3 commits
-
-
Conglong Li authored
Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., #813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file *...
-
Jeff Rasley authored
-
Olatunji Ruwase authored
Co-authored-by:Jeff Rasley <jerasley@microsoft.com>
-