Commits · 8cfd4afa92a1cd5f1e2f4c49640c5fc572fb50f1 · OpenDAS / deepspeed

29 May, 2023 1 commit
- update v0.9.2 · 5bcc463d
  aiss authored May 29, 2023
  
  5bcc463d
30 Mar, 2023 1 commit
- push dsv0.8.2 version · 67ea635f
  aiss authored Mar 30, 2023
  
  67ea635f
25 May, 2022 1 commit
- push Deepspeed 0.6.3 rocm version · 7d1a83a9
  aiss authored May 25, 2022
  
  7d1a83a9
02 Apr, 2021 1 commit

Jeff Rasley authored Apr 02, 2021

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

8db4fdf8

16 Mar, 2021 3 commits

1-bit Adam v2 (#817) · 68c8481b

Conglong Li authored Mar 16, 2021

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., #813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing
changes made to a6dba72aeafad63661dfe566d3accd03d00be78c.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

68c8481b

Fix ZeRO3 save_checkpoint (#857) · fa87a73a
Olatunji Ruwase authored Mar 16, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
fa87a73a
Allow args to be optional in deepspeed.initialize (#825) · 871f3048
Jeff Rasley authored Mar 16, 2021

871f3048

15 Mar, 2021 1 commit

Samyamr/inference hook fix (#851) · 46018859

Samyam Rajbhandari authored Mar 15, 2021



* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

46018859

11 Mar, 2021 1 commit
- set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (#844) · dd03cff2
  Jeff Rasley authored Mar 10, 2021
  
  dd03cff2
08 Mar, 2021 1 commit

ZeRO 3 Offload (#834) · 599258f9

Samyam Rajbhandari authored Mar 08, 2021



* Squash stage3 v1 (#146)
Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

599258f9

12 Feb, 2021 1 commit

Activation checkpointing for non-tensor arguments and return values (#741) · ec8b1cb0

Olatunji Ruwase authored Feb 12, 2021

* Activation checkpoint support for non tensor input/output

* Format fixes

* Address PR comments; Add ordering edge case tests

ec8b1cb0

11 Feb, 2021 1 commit

Add flops profiler tutorial (#682) · e2dfe0d1

Cheng Li authored Feb 10, 2021

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* fix tailing ws

* fix names

* remove multistep profiling and update docs

* fix cases where functionals and submodules coexist in a parent module, update readme

* fix typo

* always invoke post hook function

* fix module flops sum and update tests

* update tutorial

e2dfe0d1

29 Jan, 2021 1 commit
- Dist testing backend fixes, etc. (#708) · 2e2dd861
  Jeff Rasley authored Jan 29, 2021
  
  2e2dd861
27 Jan, 2021 1 commit
- [transformer-kernel] turn off unit test printing (#701) · 91b1b7f3
  Jeff Rasley authored Jan 27, 2021
  
  91b1b7f3
20 Jan, 2021 1 commit
- make test_pipe more stable (#683) · e59ba12d
  Shaden Smith authored Jan 20, 2021
  
  e59ba12d
15 Jan, 2021 1 commit
- Support optimizer AdamW type (#670) · 865104be
  Olatunji Ruwase authored Jan 15, 2021
  
  865104be
14 Jan, 2021 1 commit
- Validate consistent ckpt tags across ranks (#667) · f032e56f
  Jeff Rasley authored Jan 14, 2021
  
  f032e56f
13 Jan, 2021 1 commit

squash latest flops profiling changes (#1) (#664) · e2fbe4d2

Cheng Li authored Jan 12, 2021


Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

e2fbe4d2

12 Jan, 2021 1 commit
- Handle actvitation checkpointing args that are None or non-tensors (#660) · adcfd269
  Shaden Smith authored Jan 12, 2021
```
Special thanks to @g-karthik for tracking this issue down.
```
  adcfd269
08 Jan, 2021 2 commits

LR scheduler unit tests (#429) · da5563a9

Olatunji Ruwase authored Jan 08, 2021



* Add Linear warmup+decay lr schedule
Update lr schedule unit tests

* LR scheduler unit tests for LR Range Test and 1Cycle

* Disable yapf to preserve parameterizaton

* Disable test_pipe.py for CI debugging

* Disable test_lr_scheduler for CI debugging

* Disable test_lr_scheduler for CI debugging

* Enable all unit tests for CI debugging
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

da5563a9

add additional validation checks in elastic config (#646) · bc046dc4
Jeff Rasley authored Jan 08, 2021

bc046dc4

06 Jan, 2021 1 commit

Module replacement support (#586) · 44bd538b

Jeff Rasley authored Jan 06, 2021


Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

44bd538b

05 Jan, 2021 1 commit

Allow DeepSpeed models to be initialized with optimizer=None (#469) · a9a83a6f

gcooper-isi authored Jan 05, 2021



Allow DeepSpeed models to be initialized with optimizer=None
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

a9a83a6f

04 Jan, 2021 1 commit
- Support initialization with dict configuration (#632) · e6ac7311
  Olatunji Ruwase authored Jan 04, 2021
  
  e6ac7311
23 Dec, 2020 1 commit
- Elastic training support (#602) · 81aeea36
  Jeff Rasley authored Dec 22, 2020
```
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
```
  81aeea36
18 Dec, 2020 1 commit
- Ability to initialize distributed backend outside deepspeed runtime (#608) · 7435b2f1
  Jeff Rasley authored Dec 17, 2020
  
  7435b2f1
17 Dec, 2020 1 commit
- Transformer-kernel - supporting any arbitrary sequence-length (#587) · fd2f970b
  Reza Yazdani authored Dec 17, 2020
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  fd2f970b
02 Dec, 2020 1 commit
- Add 'latest' checkpoint save/load support (#569) · 845921b3
  Jeff Rasley authored Dec 02, 2020
  
  845921b3
01 Dec, 2020 1 commit

supporting different hidden dimensions (#559) · c78c29f9

Reza Yazdani authored Dec 01, 2020



* supporting different hidden dimensions

* add support for larger hidden dimensions (greater than 8K)

* remove empty line

* add loop unrolling factor for dropout kernels

* update different kernels based on the reviews
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

c78c29f9

25 Nov, 2020 1 commit
- Turn back on PP tests (#558) · eec44af1
  Jeff Rasley authored Nov 24, 2020
  
  eec44af1
21 Nov, 2020 1 commit
- Support non-tensor state in checkpoint (#548) · 6021b702
  Olatunji Ruwase authored Nov 21, 2020
  
  6021b702
20 Nov, 2020 1 commit

Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (#545) · 0178e6cc

Olatunji Ruwase authored Nov 20, 2020

* Use zero-tensors for missing gradients to avoid size mismatch

* Unit test for unbalanced gradients in ZeRO

* Formatting fixes

0178e6cc

19 Nov, 2020 1 commit

ZeRO-1 tune max-elems + bug fix (#532) · 08c96a1b

Jeff Rasley authored Nov 19, 2020

* zero-1 memory fix

* auto-tune max elems per comm to reduce padding/comm intervals

* clean-up and added previously missing reduction options

* fix testing backing to work with torch1.7

08c96a1b

18 Nov, 2020 1 commit

Fix layout bug in ZeRO Stage 1 checkpoint logic (#531) · 7752dc5e

Olatunji Ruwase authored Nov 17, 2020

* Fix layout bug in ZeRO Stage 1 checkpoint logic
Add elastic checkpoint option for ZeRO stage 1, default to True

* Format fixes

7752dc5e

12 Nov, 2020 1 commit

DeepSpeed JIT op + PyPI support (#496) · 31f46fee

Jeff Rasley authored Nov 12, 2020


Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

31f46fee

10 Nov, 2020 1 commit

PLD release (#513) · be1147c0

Olatunji Ruwase authored Nov 10, 2020



* Progressive layer dropping docs (#499)

* test

* Adding tutorial and news page for pld

* updating the tutorial and posts of PLD

* update the finetune tutorial

* Update PLD tutorial (#512)

* Update installation instructions

* Format fix

* ZeRO tutorial

* Format fixes

* ZeRO-Offload

* ZeRO and ZeRO-Offload tutorials

* Update navigation page

* Format fixes

* Add yuxhe feedback

* Fix blog post link

* Fix OneBit-Adam link
Tweak scheduler example

* Fix date link

* Add DeepSpeed_Adam

* Add PLD tutorial to navigation
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* updating the pld docs

* DeepSpeed implementation of PLD (#508)

* DeepSpeed implementation of PLD

* Format fixes

* Formatting fixes

* Fix broken url

* Address PR feedback

* Bump DSE
Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>

be1147c0

30 Oct, 2020 1 commit

Add CPUAdam optimizer for zero-offload in deepspeed engine (#484) · f5aa2547

Reza Yazdani authored Oct 30, 2020



* add adamW to CPU-ADAM implementation

* supporting cpu-adam optimizer for zero-offload on deepspeed side

* bump DSE to match cpu-adam updates
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

f5aa2547

07 Oct, 2020 2 commits
- turning off different tests (temp) · 679fc135
  Jeff Rasley authored Oct 06, 2020
  
  679fc135
- temporarily disable lr unit tests · 11cf47ef
  Jeff Rasley authored Oct 06, 2020
  
  11cf47ef
29 Sep, 2020 1 commit
- Disable default installation of CPU Adam (#450) · 7b8be2a7
  Olatunji Ruwase authored Sep 29, 2020
```
* Disable default installation of CPU Adam

* Handle cpufeature import/use errors separately
```
  7b8be2a7