Commits · fa87a73a8a3bead24ad9ea52090646fa620d74e8 · OpenDAS / deepspeed

16 Mar, 2021 2 commits
- Fix ZeRO3 save_checkpoint (#857) · fa87a73a
  Olatunji Ruwase authored Mar 16, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  fa87a73a
- Allow args to be optional in deepspeed.initialize (#825) · 871f3048
  Jeff Rasley authored Mar 16, 2021
  
  871f3048
15 Mar, 2021 1 commit

Samyamr/inference hook fix (#851) · 46018859

Samyam Rajbhandari authored Mar 15, 2021



* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

46018859

11 Mar, 2021 1 commit
- set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (#844) · dd03cff2
  Jeff Rasley authored Mar 10, 2021
  
  dd03cff2
08 Mar, 2021 1 commit

ZeRO 3 Offload (#834) · 599258f9

Samyam Rajbhandari authored Mar 08, 2021



* Squash stage3 v1 (#146)
Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

599258f9

12 Feb, 2021 1 commit

Activation checkpointing for non-tensor arguments and return values (#741) · ec8b1cb0

Olatunji Ruwase authored Feb 12, 2021

* Activation checkpoint support for non tensor input/output

* Format fixes

* Address PR comments; Add ordering edge case tests

ec8b1cb0

11 Feb, 2021 1 commit

Add flops profiler tutorial (#682) · e2dfe0d1

Cheng Li authored Feb 10, 2021

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* fix tailing ws

* fix names

* remove multistep profiling and update docs

* fix cases where functionals and submodules coexist in a parent module, update readme

* fix typo

* always invoke post hook function

* fix module flops sum and update tests

* update tutorial

e2dfe0d1

29 Jan, 2021 1 commit
- Dist testing backend fixes, etc. (#708) · 2e2dd861
  Jeff Rasley authored Jan 29, 2021
  
  2e2dd861
27 Jan, 2021 1 commit
- [transformer-kernel] turn off unit test printing (#701) · 91b1b7f3
  Jeff Rasley authored Jan 27, 2021
  
  91b1b7f3
20 Jan, 2021 1 commit
- make test_pipe more stable (#683) · e59ba12d
  Shaden Smith authored Jan 20, 2021
  
  e59ba12d
15 Jan, 2021 1 commit
- Support optimizer AdamW type (#670) · 865104be
  Olatunji Ruwase authored Jan 15, 2021
  
  865104be
14 Jan, 2021 1 commit
- Validate consistent ckpt tags across ranks (#667) · f032e56f
  Jeff Rasley authored Jan 14, 2021
  
  f032e56f
13 Jan, 2021 1 commit

squash latest flops profiling changes (#1) (#664) · e2fbe4d2

Cheng Li authored Jan 12, 2021


Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

e2fbe4d2

12 Jan, 2021 1 commit
- Handle actvitation checkpointing args that are None or non-tensors (#660) · adcfd269
  Shaden Smith authored Jan 12, 2021
```
Special thanks to @g-karthik for tracking this issue down.
```
  adcfd269
08 Jan, 2021 2 commits

LR scheduler unit tests (#429) · da5563a9

Olatunji Ruwase authored Jan 08, 2021



* Add Linear warmup+decay lr schedule
Update lr schedule unit tests

* LR scheduler unit tests for LR Range Test and 1Cycle

* Disable yapf to preserve parameterizaton

* Disable test_pipe.py for CI debugging

* Disable test_lr_scheduler for CI debugging

* Disable test_lr_scheduler for CI debugging

* Enable all unit tests for CI debugging
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

da5563a9

add additional validation checks in elastic config (#646) · bc046dc4
Jeff Rasley authored Jan 08, 2021

bc046dc4

06 Jan, 2021 1 commit

Module replacement support (#586) · 44bd538b

Jeff Rasley authored Jan 06, 2021


Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

44bd538b

05 Jan, 2021 1 commit

Allow DeepSpeed models to be initialized with optimizer=None (#469) · a9a83a6f

gcooper-isi authored Jan 05, 2021



Allow DeepSpeed models to be initialized with optimizer=None
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

a9a83a6f

04 Jan, 2021 1 commit
- Support initialization with dict configuration (#632) · e6ac7311
  Olatunji Ruwase authored Jan 04, 2021
  
  e6ac7311
23 Dec, 2020 1 commit
- Elastic training support (#602) · 81aeea36
  Jeff Rasley authored Dec 22, 2020
```
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
```
  81aeea36
18 Dec, 2020 1 commit
- Ability to initialize distributed backend outside deepspeed runtime (#608) · 7435b2f1
  Jeff Rasley authored Dec 17, 2020
  
  7435b2f1
17 Dec, 2020 1 commit
- Transformer-kernel - supporting any arbitrary sequence-length (#587) · fd2f970b
  Reza Yazdani authored Dec 17, 2020
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  fd2f970b
02 Dec, 2020 1 commit
- Add 'latest' checkpoint save/load support (#569) · 845921b3
  Jeff Rasley authored Dec 02, 2020
  
  845921b3
01 Dec, 2020 1 commit

supporting different hidden dimensions (#559) · c78c29f9

Reza Yazdani authored Dec 01, 2020



* supporting different hidden dimensions

* add support for larger hidden dimensions (greater than 8K)

* remove empty line

* add loop unrolling factor for dropout kernels

* update different kernels based on the reviews
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

c78c29f9

25 Nov, 2020 1 commit
- Turn back on PP tests (#558) · eec44af1
  Jeff Rasley authored Nov 24, 2020
  
  eec44af1
21 Nov, 2020 1 commit
- Support non-tensor state in checkpoint (#548) · 6021b702
  Olatunji Ruwase authored Nov 21, 2020
  
  6021b702
20 Nov, 2020 1 commit

Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (#545) · 0178e6cc

Olatunji Ruwase authored Nov 20, 2020

* Use zero-tensors for missing gradients to avoid size mismatch

* Unit test for unbalanced gradients in ZeRO

* Formatting fixes

0178e6cc

19 Nov, 2020 1 commit

ZeRO-1 tune max-elems + bug fix (#532) · 08c96a1b

Jeff Rasley authored Nov 19, 2020

* zero-1 memory fix

* auto-tune max elems per comm to reduce padding/comm intervals

* clean-up and added previously missing reduction options

* fix testing backing to work with torch1.7

08c96a1b

18 Nov, 2020 1 commit

Fix layout bug in ZeRO Stage 1 checkpoint logic (#531) · 7752dc5e

Olatunji Ruwase authored Nov 17, 2020

* Fix layout bug in ZeRO Stage 1 checkpoint logic
Add elastic checkpoint option for ZeRO stage 1, default to True

* Format fixes

7752dc5e

12 Nov, 2020 1 commit

DeepSpeed JIT op + PyPI support (#496) · 31f46fee

Jeff Rasley authored Nov 12, 2020


Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

31f46fee

10 Nov, 2020 1 commit

PLD release (#513) · be1147c0

Olatunji Ruwase authored Nov 10, 2020



* Progressive layer dropping docs (#499)

* test

* Adding tutorial and news page for pld

* updating the tutorial and posts of PLD

* update the finetune tutorial

* Update PLD tutorial (#512)

* Update installation instructions

* Format fix

* ZeRO tutorial

* Format fixes

* ZeRO-Offload

* ZeRO and ZeRO-Offload tutorials

* Update navigation page

* Format fixes

* Add yuxhe feedback

* Fix blog post link

* Fix OneBit-Adam link
Tweak scheduler example

* Fix date link

* Add DeepSpeed_Adam

* Add PLD tutorial to navigation
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* updating the pld docs

* DeepSpeed implementation of PLD (#508)

* DeepSpeed implementation of PLD

* Format fixes

* Formatting fixes

* Fix broken url

* Address PR feedback

* Bump DSE
Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>

be1147c0

30 Oct, 2020 1 commit

Add CPUAdam optimizer for zero-offload in deepspeed engine (#484) · f5aa2547

Reza Yazdani authored Oct 30, 2020



* add adamW to CPU-ADAM implementation

* supporting cpu-adam optimizer for zero-offload on deepspeed side

* bump DSE to match cpu-adam updates
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

f5aa2547

07 Oct, 2020 2 commits
- turning off different tests (temp) · 679fc135
  Jeff Rasley authored Oct 06, 2020
  
  679fc135
- temporarily disable lr unit tests · 11cf47ef
  Jeff Rasley authored Oct 06, 2020
  
  11cf47ef
29 Sep, 2020 1 commit
- Disable default installation of CPU Adam (#450) · 7b8be2a7
  Olatunji Ruwase authored Sep 29, 2020
```
* Disable default installation of CPU Adam

* Handle cpufeature import/use errors separately
```
  7b8be2a7
25 Sep, 2020 1 commit
- unit test rename (#442) · 5412a334
  Shaden Smith authored Sep 25, 2020
  
  5412a334
22 Sep, 2020 1 commit
- support dynamic sequence length in transformer kernels (#424) · f0f2a702
  RezaYazdaniAminabadi authored Sep 21, 2020
```
Co-authored-by: Conglong Li <conglong.li@gmail.com>
```
  f0f2a702
21 Sep, 2020 1 commit
- Add configurable intermediate size to transformer kernels (#423) · a148bd33
  RezaYazdaniAminabadi authored Sep 21, 2020
  
  a148bd33
18 Sep, 2020 2 commits
- Fix activation checkpoint unit tests for GPU systems (#421) · a825f996
  Shaden Smith authored Sep 18, 2020
  
  a825f996
- Revert "Activation checkpointing bugfix and unit tests (#420)" (#422) · a74a604a
  Jeff Rasley authored Sep 18, 2020
```
This reverts commit 01b6e27e

.
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
```
  a74a604a