Commits · 9f52a36fad37bcb26e19dfc767d4405310f52399 · OpenDAS / deepspeed

01 Dec, 2020 2 commits

tracking optimizer step in cpu-adam when loading checkpoint (#564) · 9f52a36f

Reza Yazdani authored Dec 01, 2020

* tracking optimizer step in cpu-adam when loading checkpoint

* add warning/error message for updating optimizer step count

* resolve build issue

* supporting state update from the python side

* track step from python in all cases

* remove comma

9f52a36f

supporting different hidden dimensions (#559) · c78c29f9

Reza Yazdani authored Dec 01, 2020



* supporting different hidden dimensions

* add support for larger hidden dimensions (greater than 8K)

* remove empty line

* add loop unrolling factor for dropout kernels

* update different kernels based on the reviews
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

c78c29f9

28 Nov, 2020 1 commit

[doc] typo fix and clarification (#563) · 17f36f1b

Stas Bekman authored Nov 27, 2020

This PR:
* fixes a misspelled method name
* also `( () )` doesn't read too well, until one reads the code and understands that it's not a formatting bug. I proposed to simply say that it's a callable object.

17f36f1b

25 Nov, 2020 4 commits
- bump to 0.3.7 · c51fa65d
  Jeff Rasley authored Nov 25, 2020
  
  c51fa65d
- update manifest · e4e20662
  Jeff Rasley authored Nov 25, 2020
  
  e4e20662
- bump to 0.3.6 and fix manifest to include reqs (#561) · 73c3262d
  Jeff Rasley authored Nov 25, 2020
  
  73c3262d
- Adds long_description to setup.py (#560) · 60097136
  Shaden Smith authored Nov 25, 2020
  
  60097136
23 Nov, 2020 1 commit
- bump to 0.3.5 · 16313a96
  Jeff Rasley authored Nov 23, 2020
  
  16313a96
25 Nov, 2020 2 commits
- Turn back on PP tests (#558) · eec44af1
  Jeff Rasley authored Nov 24, 2020
  
  eec44af1
- Simplify dist init and only init if needed. (#553) · 0e831e23
  Ammar Ahmad Awan authored Nov 24, 2020
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  0e831e23
24 Nov, 2020 4 commits
- Deprecate client ability to disable gradient reduction (#552) · 6e65c2cc
  Olatunji Ruwase authored Nov 24, 2020
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  6e65c2cc
- Update badges and CI name (#557) · 1ef5cd23
  Jeff Rasley authored Nov 24, 2020
  
  1ef5cd23
- Switch to CI to GitHub Actions (#556) · 3347460e
  Jeff Rasley authored Nov 24, 2020
  
  3347460e
- Create main.yml · c18fb0de
  Jeff Rasley authored Nov 24, 2020
  
  c18fb0de
23 Nov, 2020 2 commits
- Bug fix for norm calculation in absence of model parallel group (#551) · 00c3a254
  Samyam Rajbhandari authored Nov 23, 2020
```
In the absence of a model parallel group, model_parallel_allreduce should not do any reduction. This commit fixes the bug which was doing a model parallel allreduce across world group when model parallel group is None
```
  00c3a254
- Adding static_loss_scale to unfused optimizer (#546) · bcd56f97
  Samyam Rajbhandari authored Nov 22, 2020
  
  bcd56f97
21 Nov, 2020 1 commit
- Support non-tensor state in checkpoint (#548) · 6021b702
  Olatunji Ruwase authored Nov 21, 2020
  
  6021b702
20 Nov, 2020 1 commit

Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (#545) · 0178e6cc

Olatunji Ruwase authored Nov 20, 2020

* Use zero-tensors for missing gradients to avoid size mismatch

* Unit test for unbalanced gradients in ZeRO

* Formatting fixes

0178e6cc

19 Nov, 2020 7 commits
- bump version 0.3.4 · 6b28bc5d
  Jeff Rasley authored Nov 19, 2020
  
  6b28bc5d
- Discover variables for NCCL backend on AML without mpi4py (#542) · 1b45917c
  Ammar Ahmad Awan authored Nov 19, 2020
```
* Use AML method to set env vars instead of using mpi4py.
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  1b45917c
- Fix setup.py for cpu-only environment installation (#538) · d81cb26d
  Seunghwan Hong authored Nov 20, 2020
```
* Add guard to not using `torch.version.cuda` above no-CUDA environment.
* Fix several typos on setup.py.
Signed-off-by: Seunghwan Hong <seunghwan@scatterlab.co.kr>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  d81cb26d
- backwards compatability w. v020 ckpts, fix issue with zero-1 ckpts (#543) · dce054db
  Jeff Rasley authored Nov 19, 2020
  
  dce054db
- bump to v0.3.3 · 9de21b72
  Jeff Rasley authored Nov 19, 2020
  
  9de21b72
- ZeRO-1 tune max-elems + bug fix (#532) · 08c96a1b
  Jeff Rasley authored Nov 19, 2020
```
* zero-1 memory fix

* auto-tune max elems per comm to reduce padding/comm intervals

* clean-up and added previously missing reduction options

* fix testing backing to work with torch1.7
```
  08c96a1b
- more fine-grained manifest file for includes/excludes (#540) · fdd81c30
  Jeff Rasley authored Nov 18, 2020
  
  fdd81c30
18 Nov, 2020 2 commits
- append job-name if explicit output dir is given (#539) · 5b09be60
  Jeff Rasley authored Nov 18, 2020
  
  5b09be60
- Fix layout bug in ZeRO Stage 1 checkpoint logic (#531) · 7752dc5e
  Olatunji Ruwase authored Nov 17, 2020
```
* Fix layout bug in ZeRO Stage 1 checkpoint logic
Add elastic checkpoint option for ZeRO stage 1, default to True

* Format fixes
```
  7752dc5e
14 Nov, 2020 2 commits
- bump version · 9941ce75
  Jeff Rasley authored Nov 14, 2020
  
  9941ce75
- Dependency pruning (#528) · 0dc84200
  Jeff Rasley authored Nov 13, 2020
```
* remove cpu-feature

* remove psutils requirement
```
  0dc84200
12 Nov, 2020 3 commits

Installation documentation updates. (#525) · d779bd53
Shaden Smith authored Nov 12, 2020
```
* Adds torch install requirement to documentation.

* build ops documentation
```
d779bd53

ds_report bug fix on cpu and guard torch import in setup.py (#524) · ca9ab120

Jeff Rasley authored Nov 12, 2020

* on cpu box error gracefully if cuda home doesn't exist

* gaurd against torch import issue

* fix sytax error

* fix import

ca9ab120

DeepSpeed JIT op + PyPI support (#496) · 31f46fee

Jeff Rasley authored Nov 12, 2020


Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

31f46fee

11 Nov, 2020 2 commits

Update zero.md tutorial (#495) · 0ad4fd88

Samyam Rajbhandari authored Nov 11, 2020



* Update zero.md

Update to ZeRO tutorial to specify the use of activation checkpointing

* Update zero-offload.md

Use activation checkpointing with ZeRO-Offload
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

0ad4fd88

fix bug on non-DLTS infra when no output path set (#523) · eea1c285
Jeff Rasley authored Nov 11, 2020

eea1c285

10 Nov, 2020 2 commits

PLD release (#513) · be1147c0

Olatunji Ruwase authored Nov 10, 2020



* Progressive layer dropping docs (#499)

* test

* Adding tutorial and news page for pld

* updating the tutorial and posts of PLD

* update the finetune tutorial

* Update PLD tutorial (#512)

* Update installation instructions

* Format fix

* ZeRO tutorial

* Format fixes

* ZeRO-Offload

* ZeRO and ZeRO-Offload tutorials

* Update navigation page

* Format fixes

* Add yuxhe feedback

* Fix blog post link

* Fix OneBit-Adam link
Tweak scheduler example

* Fix date link

* Add DeepSpeed_Adam

* Add PLD tutorial to navigation
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* updating the pld docs

* DeepSpeed implementation of PLD (#508)

* DeepSpeed implementation of PLD

* Format fixes

* Formatting fixes

* Fix broken url

* Address PR feedback

* Bump DSE
Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>

be1147c0

updating pld docs (#517) · e082d475
Minjia Zhang authored Nov 10, 2020

e082d475

09 Nov, 2020 2 commits
- Fix PLD news url (#515) · 41fb24b3
  Olatunji Ruwase authored Nov 09, 2020
```
* PLD documentation

* Formatting fixes

* Fix url bug
```
  41fb24b3
- PLD documentation (#514) · e351090c
  Olatunji Ruwase authored Nov 09, 2020
```
* PLD documentation

* Formatting fixes
```
  e351090c
05 Nov, 2020 1 commit

Fixing CPU-Adam convergence issue (#503) · 7d4d742b

Reza Yazdani authored Nov 05, 2020

* fixing cpu-adam

* fixing copy with optimizer for data and model parallelism

* fixing cpu-adam

* fix cpu-adam

* fix cpu-adam

7d4d742b

30 Oct, 2020 1 commit
- fixing the AVX_256 compatibility (#497) · 4c37d705
  Reza Yazdani authored Oct 30, 2020
  
  4c37d705