Commits · 7f03282c5170ffca2d1c000776b0334d7fd5c97e · OpenDAS / deepspeed

25 Mar, 2021 1 commit

[debug utils] see_memory_usage fixes (#890) · 7f03282c

Stas Bekman authored Mar 25, 2021

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

7f03282c

24 Mar, 2021 1 commit

[doc] pipeline (#888) · 22d5a1f3

Stas Bekman authored Mar 23, 2021

* [doc] pipeline

As @g-karthik flagged in https://github.com/microsoft/DeepSpeed/pull/659#discussion_r600132598 my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

22d5a1f3

18 Mar, 2021 2 commits

[doc] launcher (#868) · 9e9f8cbe

Stas Bekman authored Mar 18, 2021

As discussed in https://github.com/microsoft/DeepSpeed/issues/662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: https://github.com/microsoft/DeepSpeed/issues/662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

9e9f8cbe

consistent checkpoint filenaming (#865) · 10c0bea6

Stas Bekman authored Mar 18, 2021



* consistent checkpoint filenaming

* backward compatible rename
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

10c0bea6

16 Mar, 2021 7 commits

1-bit Adam v2 (#817) · 68c8481b

Conglong Li authored Mar 16, 2021

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., #813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing
changes made to a6dba72aeafad63661dfe566d3accd03d00be78c.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

68c8481b

bump version 0.3.13 · 12a53b43
Jeff Rasley authored Mar 16, 2021

12a53b43
Make config objects json serializable (#862) · 7bcd72a2
Olatunji Ruwase authored Mar 16, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
7bcd72a2
Fix ZeRO3 save_checkpoint (#857) · fa87a73a
Olatunji Ruwase authored Mar 16, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
fa87a73a
Allow args to be optional in deepspeed.initialize (#825) · 871f3048
Jeff Rasley authored Mar 16, 2021

871f3048
docs: minor spelling tweaks (#858) · 547d1c5f
brett koonce authored Mar 16, 2021

547d1c5f
[runner/launch] propagate the error (#854) · 24335d49
Stas Bekman authored Mar 16, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
24335d49

15 Mar, 2021 2 commits

ZeRO Stage 2: Clear reduced gradients (#856) · a75d971b

Olatunji Ruwase authored Mar 15, 2021



* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

a75d971b

Samyamr/inference hook fix (#851) · 46018859

Samyam Rajbhandari authored Mar 15, 2021



* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

46018859

14 Mar, 2021 1 commit
- [doc] pipeline doc typos/improvements (#659) · 73d762c8
  Stas Bekman authored Mar 14, 2021
```
Admin merging for pure-doc PR that does not trigger build.
```
  73d762c8
12 Mar, 2021 3 commits
- Bug fix: Remove client optimizer param_group list item that does not have 'params' (#827) · 458ff028
  Cheng Li authored Mar 12, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  458ff028
- bump to v0.3.12 · 35fd7ccd
  Jeff Rasley authored Mar 12, 2021
  
  35fd7ccd
- [WarmupDecayLR] fix log(0) & 1/log(1) bugs (#772) · 18a26f3f
  Stas Bekman authored Mar 11, 2021
```
* fix log(0) & 1/log(1) bugs

* simplify
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
```
  18a26f3f
11 Mar, 2021 5 commits
- Control ZeRO wall clock timers (#849) · 311795d0
  Olatunji Ruwase authored Mar 11, 2021
```
* Control ZeRO wall clock timers

* Disable more ZeRO3 debug prints
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  311795d0
- small tweaks (#839) · 7925d0c3
  Stas Bekman authored Mar 11, 2021
  
  7925d0c3
- Add optimizers and schedules to RTD and updated the corresponding part in the website (#799) · e0f36ed5
  Cheng Li authored Mar 11, 2021
```
* add optimizers and schedules to rtd

* update ds website and fix links

* add optimizers and schedules to rtd

* update ds website and fix links

* add flops profiler to rtd

* fix
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
```
  e0f36ed5
- less scary overflow notice (#833) · 29853c3e
  Stas Bekman authored Mar 10, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  29853c3e
- set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (#844) · dd03cff2
  Jeff Rasley authored Mar 10, 2021
  
  dd03cff2
10 Mar, 2021 1 commit
- bumping DSE pointer (#847) · 564eb4bd
  Shaden Smith authored Mar 10, 2021
  
  564eb4bd
09 Mar, 2021 2 commits
- Fix regression in runner (#843) · 2e6692c8
  Jeff Rasley authored Mar 09, 2021
  
  2e6692c8
- replace home env with ~ · 49496364
  Jeff Rasley authored Mar 09, 2021
  
  49496364
08 Mar, 2021 7 commits

Model scale changing 5x to 3x · 6adc19a6
Samyam Rajbhandari authored Mar 08, 2021

6adc19a6
Fix for RTD · af548971
Jeff Rasley authored Mar 08, 2021

af548971
bump DSE to include ZeRO-3 · 9c5eee3d
Jeff Rasley authored Mar 08, 2021

9c5eee3d
Fix zero3 tutorial link · 75ffdaf7
Jeff Rasley authored Mar 08, 2021

75ffdaf7
update tutorial/doc links for zero3 (#835) · d7de9165
Jeff Rasley authored Mar 08, 2021

d7de9165

ZeRO 3 Offload (#834) · 599258f9

Samyam Rajbhandari authored Mar 08, 2021



* Squash stage3 v1 (#146)
Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic f...

599258f9

Update ZeRO-Offload tutorials (#824) · ba33e86e
Olatunji Ruwase authored Mar 08, 2021

ba33e86e

03 Mar, 2021 1 commit

Fixing gelu_checkpointing memory issue (#812) · 8295d7a8

Reza Yazdani authored Mar 03, 2021

* fixing buffers in transformer kernel when gelu-checkpoint is enabled

* fixing the test issue for other memory optimization flags

* fixing a bug for when attn_dropout_checkpoint is enabled

8295d7a8

28 Feb, 2021 1 commit

issue with the implementation of column_sum_reduce (#804) · 937c5cee

zmx authored Mar 01, 2021

hi, i take a look at the code of column_sum_reduce, i have 2 questions:
   1. the goal of column_sum_reduce is to get the column sum of inp matrix with shape[rows, width] and the result shape should be [width],right ? It seems that the judgment condition of pos is not suitable
   2. the implementation of cuda kernel based on the asumption that, the thread with same threadIdx.y will group into a thread_block_tile, the blockDim is (32,32), i read the nvidia document https://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf

, THREAD BLOCK TILE is a subset of threads of a thread block, divided into tiles in row-major order. doesn't it mean thread with the same threadIdx.x will group into a thread_block_tile ?
thanks !!!!
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

937c5cee

27 Feb, 2021 1 commit
- fixed typo (#802) · db987cf1
  vfdev authored Feb 27, 2021
  
  db987cf1
26 Feb, 2021 3 commits
- document the requirement to call for all ranks (#801) · 7eb083c2
  Stas Bekman authored Feb 26, 2021
  
  7eb083c2
- fixing the compiling issue for the AMD architecture (#796) · 490e6f7c
  Reza Yazdani authored Feb 26, 2021
```
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  490e6f7c
- Delete out2 (#798) · 62396b71
  vfdev authored Feb 26, 2021
  
  62396b71
24 Feb, 2021 2 commits
- Fix the bias-add and add the layer-norm-eps parameter (#791) · e2dfcadf
  Reza Yazdani authored Feb 24, 2021
```
* fix the bias-add precision and indexing and also adding the layer-norm-eps as a configurable parameter for transformer

* add ACC_HALF config

* use defined to check if ACC_Half is defined
```
  e2dfcadf
- Fixing the module-inject Api (#786) · 48065c06
  Reza Yazdani authored Feb 24, 2021
  
  48065c06