Commits · c25a91b60c5192065dfdcabd373b947aa2234fe1 · OpenDAS / deepspeed

30 May, 2023 2 commits
- Merge branch 'ds-v0.9.2-rocm' into 'main' · c25a91b6
  aiss authored May 30, 2023
```
Ds v0.9.2 rocm

See merge request dcutoolkit/deeplearing/deepspeed!2
```
  c25a91b6
- modify test folder · af82b300
  aiss authored May 30, 2023
  
  af82b300
29 May, 2023 2 commits
- add dtk version · 8cfd4afa
  aiss authored May 29, 2023
  
  8cfd4afa
- update v0.9.2 · 5bcc463d
  aiss authored May 29, 2023
  
  5bcc463d
11 May, 2023 3 commits
- Merge branch 'ds-v0.8.2-rocm' into 'main' · d1596c94
  aiss authored May 11, 2023
```
Ds v0.8.2 rocm

See merge request aicomponent/deepspeed!3
```
  d1596c94
- update readme_hip · ac5fbab4
  aiss authored May 11, 2023
  
  ac5fbab4
- add readme_hip.md · 141ff533
  aiss authored May 11, 2023
  
  141ff533
27 Apr, 2023 2 commits
- Merge branch 'ds-v0.8.2-rocm' into 'main' · 6a707da5
  aiss authored Apr 27, 2023
```
modify error

See merge request aicomponent/deepspeed!2
```
  6a707da5
- modify error · 0f3656b9
  aiss authored Apr 27, 2023
  
  0f3656b9
26 Apr, 2023 3 commits
- Merge branch 'ds-v0.8.2-rocm' into 'main' · 899b52ce
  aiss authored Apr 26, 2023
```
Ds v0.8.2 rocm, support torch1.13 for the hipify change

See merge request aicomponent/deepspeed!1
```
  899b52ce
- delete hip file · 4acf0e01
  aiss authored Apr 26, 2023
  
  4acf0e01
- support torch1.13 and torch1.10 · 7dd68788
  aiss authored Apr 26, 2023
  
  7dd68788
30 Mar, 2023 1 commit
- push dsv0.8.2 version · 67ea635f
  aiss authored Mar 30, 2023
  
  67ea635f
10 Aug, 2022 1 commit
- modify version code · 1b2721ad
  aiss authored Aug 10, 2022
  
  1b2721ad
14 Jun, 2022 1 commit
- Update setup.py · c3e434ae
  aiss authored Jun 14, 2022
  
  c3e434ae
11 Jun, 2022 4 commits
- Merge branch 'deepspeed-0.6.3-rocm' of... · d335bffa
  aiss authored Jun 11, 2022
```
Merge branch 'deepspeed-0.6.3-rocm' of http://10.0.100.3/dcutoolkit/deeplearing/deepspeed into deepspeed-0.6.3-rocm
version modify
```
  d335bffa
- modify whl name · 5da48343
  aiss authored Jun 11, 2022
  
  5da48343
- Update requirements-sparse_attn.txt · 7fa189a6
  aiss authored Jun 11, 2022
  
  7fa189a6
- add dtk version · 9b6449e6
  aiss authored Jun 11, 2022
  
  9b6449e6
26 May, 2022 1 commit
- modify dtk path · d8669b08
  aiss authored May 26, 2022
  
  d8669b08
25 May, 2022 1 commit
- push Deepspeed 0.6.3 rocm version · 7d1a83a9
  aiss authored May 25, 2022
  
  7d1a83a9
02 Apr, 2021 2 commits

Add link to AML examples. (#916) · ab5534fc
Ammar Ahmad Awan authored Apr 02, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
ab5534fc

Jeff Rasley authored Apr 02, 2021

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

8db4fdf8

01 Apr, 2021 1 commit

zero.Init() clarification (#880) · 5d721e09

Stas Bekman authored Apr 01, 2021



* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

5d721e09

31 Mar, 2021 3 commits
- [website] we're hiring! · c814abda
  Jeff Rasley authored Mar 30, 2021
  
  c814abda
- [website] We're hiring! + integration posts · c6b497df
  Jeff Rasley authored Mar 30, 2021
  
  c6b497df
- We're hiring! + integration posts · 8c9e16eb
  Jeff Rasley authored Mar 30, 2021
  
  8c9e16eb
30 Mar, 2021 3 commits

Bump kramdown from 2.3.0 to 2.3.1 in /docs (#905) · c0422642

dependabot[bot] authored Mar 30, 2021

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits

)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

c0422642

update backward api doc (#903) · 23ff6cb7
Jeff Rasley authored Mar 30, 2021

23ff6cb7
update kramdown (#901) · af2d8fc5
Jeff Rasley authored Mar 30, 2021
```
security alert related to older kramdown version
```
af2d8fc5

27 Mar, 2021 2 commits

Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (#861) · 7fcc8911

hamlet authored Mar 27, 2021

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in https://github.com/microsoft/DeepSpeed/issues/707



As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

7fcc8911

save_fp16_model consolidated for zero3 (#893) · 39013dd2
Stas Bekman authored Mar 26, 2021
```
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
```
39013dd2

26 Mar, 2021 1 commit
- full fp32 weights reconstruction for zero 2+3 (#892) · 7531c6bf
  Stas Bekman authored Mar 26, 2021
  
  7531c6bf
25 Mar, 2021 1 commit

[debug utils] see_memory_usage fixes (#890) · 7f03282c

Stas Bekman authored Mar 25, 2021

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

7f03282c

24 Mar, 2021 1 commit

[doc] pipeline (#888) · 22d5a1f3

Stas Bekman authored Mar 23, 2021

* [doc] pipeline

As @g-karthik flagged in https://github.com/microsoft/DeepSpeed/pull/659#discussion_r600132598 my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

22d5a1f3

18 Mar, 2021 2 commits

[doc] launcher (#868) · 9e9f8cbe

Stas Bekman authored Mar 18, 2021

As discussed in https://github.com/microsoft/DeepSpeed/issues/662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: https://github.com/microsoft/DeepSpeed/issues/662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

9e9f8cbe

consistent checkpoint filenaming (#865) · 10c0bea6

Stas Bekman authored Mar 18, 2021



* consistent checkpoint filenaming

* backward compatible rename
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

10c0bea6

16 Mar, 2021 3 commits

1-bit Adam v2 (#817) · 68c8481b

Conglong Li authored Mar 16, 2021

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., #813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing
changes made to a6dba72aeafad63661dfe566d3accd03d00be78c.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

68c8481b

bump version 0.3.13 · 12a53b43
Jeff Rasley authored Mar 16, 2021

12a53b43
Make config objects json serializable (#862) · 7bcd72a2
Olatunji Ruwase authored Mar 16, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
7bcd72a2