Commits · 376818ef9df2d9d8932beb7adde10f89b3653570 · OpenDAS / deepspeed

15 Jul, 2020 2 commits
- Empty grad fix (#291) · 376818ef
  Jeff Rasley authored Jul 15, 2020
```
* empty grad fix
* add unit tests for empty grad
```
  376818ef
- Fix bug in fp32 optimizer state loading (#289) · 607814fe
  Olatunji Ruwase authored Jul 15, 2020
  
  607814fe
14 Jul, 2020 1 commit

Support loading and saving ZeRO checkpoints with changing DP degree (#240) · 7ccc9daf

Olatunji Ruwase authored Jul 14, 2020



* Support saving and loading ZeRO checkpoints on different data
parallelism degree.

* Fix formatting

* Support checkpoint with varying GPU count in ZeRO stage 1

* Fix formatting

* Formatting fixes

* Update model tests

* Remove pprint

* Minor fix

* Fix formatting

* Update model tests
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

7ccc9daf

11 Jul, 2020 1 commit
- Support amp deepspeed backend (#286) · f5453124
  Jeff Rasley authored Jul 11, 2020
```
* add amp support for deepspeed (non-ZeRO)
* tests for amp mode
```
  f5453124
06 Jul, 2020 1 commit

ZeRO-2: Handle gradients of empty partitions (#275) · 4a3234e0

Olatunji Ruwase authored Jul 06, 2020



* Load non-DeepSpeed checkpoints into ZeRO optimizer

* Handle parameters smaller than DP

* Formatting fixes

* Handle empty partitions

* Fix perf bug
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

4a3234e0

23 Jun, 2020 1 commit

Handle parameter groups smaller than DP (#273) · 88c319aa

Olatunji Ruwase authored Jun 23, 2020

* Load non-DeepSpeed checkpoints into ZeRO optimizer

* Handle parameters smaller than DP

* Formatting fixes

88c319aa

30 May, 2020 1 commit
- update tests · bbd8cd7d
  Jeff Rasley authored May 29, 2020
  
  bbd8cd7d
29 May, 2020 1 commit

Transformer kernel release (#242) · 734d8991

Jeff Rasley authored May 29, 2020



* Transformer kernels release
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>

734d8991

27 May, 2020 1 commit
- Support fp32 grad clipping and fix max_grad_norm confusion (#232) · abe2204d
  Jeff Rasley authored May 26, 2020
```
* updates to support fp32 grad clipping and disable max_grad_norm
```
  abe2204d
20 May, 2020 1 commit
- reduce size of megatron tests (#223) · 53ac7947
  Jeff Rasley authored May 20, 2020
  
  53ac7947
19 May, 2020 1 commit

ZeRO-2 (#217) · f2ac7eaf

Jeff Rasley authored May 19, 2020



Updates for ZeRO stage 2 + ZeRO stage 1 w. RS
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: yuxionghe <yuxhe@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>

f2ac7eaf

18 May, 2020 1 commit

adding BingSqaud e2e test (#214) · c61e23b4

Arash Ashari authored May 18, 2020

* adding BingSqaud e2e test

* updating the draft test; bring final step under try section

* finalizinf test for base deepspeed and deepspeed with ZeRO

* applying the comment (thanks Jeff); fixed formatting

c61e23b4

11 May, 2020 1 commit
- Support dynamic loss scale args in fp16 optimizers (#212) · 0be026e3
  Olatunji Ruwase authored May 11, 2020
```
* Support dynamic loss scale args in fp16 optimizers

* Update names
```
  0be026e3
06 May, 2020 1 commit
- Fix global_steps checkpoint loading. (#139) · b2c87edf
  Shaden Smith authored May 06, 2020
  
  b2c87edf
30 Apr, 2020 1 commit

Upgrade apex version, turn off legacy fusion (#205) · 3ce531c9

Jeff Rasley authored Apr 30, 2020

* update apex version to feb 5th commit

* use gradient clipping instead of max grad norm in tests

* add warning when user provides max_grad_norm

* update examples commit

3ce531c9

24 Apr, 2020 1 commit
- Fix index out of range error when parameter count is not multiple of ranks (#202) · 512a0d4d
  Olatunji Ruwase authored Apr 24, 2020
  
  512a0d4d
27 Mar, 2020 2 commits

Support multi-output models (#170) · 53c73fe3

Olatunji Ruwase authored Mar 27, 2020

* Push to remote

* Correctly handle multi output models by doing loss scaling in backward()
Unit tests for multi output models

* Fix formatting issues

* Formatting issues fix

* Fix formatting

* Update DeepSpeedExamples submodule
Enable Megatron model tests

53c73fe3

Add "zero_allow_untested_optimizer" option in conf file (#173) · 43f27332

Calogero Zarbo authored Mar 27, 2020

* added zero_allow_untested_optimizer flag helpers

* add zero_allow_untested_optimizer config constants

* zero_allow_untested_optimizer logic with assertion

* Added unit test and CustomOptimizer helper class

43f27332

25 Mar, 2020 1 commit
- Adding static loss scaling for ZeRO. (#166) · a76572dc
  Shaden Smith authored Mar 25, 2020
  
  a76572dc
10 Mar, 2020 2 commits

Enhancement: Ability to load checkpoint without loading the optimizer… (#128) · 936117b5

Samyam Rajbhandari authored Mar 10, 2020

* Enhancement: Ability to load checkpoint without loading the optimizer states. Unittest testing saving and loading checkpoint with fused, unfused and zero optimizer. The unitest takes about 165s

936117b5

Make lr schedulers support fp16 optimizers (#124) · 1c0b326e

Olatunji Ruwase authored Mar 10, 2020



* add tests cases for onecycle policy with fp16/zero

* Make lr schedulers support fp16 optimizers

* Fix formatting

* More specific naming
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

1c0b326e

27 Feb, 2020 1 commit
- add some csr addition unit tests (#110) · ccec2463
  Jeff Rasley authored Feb 27, 2020
  
  ccec2463
26 Feb, 2020 1 commit

Init distributed torch only if needed (#108) · 5aa58b38

Jeff Rasley authored Feb 26, 2020

* add auto-detect to torch dist init

* update tests to infer distributed init status

* prevent crash if dist_init_required is True but already initiliazed

* only init if safe to do so (forgot to add this file in prev commit)

5aa58b38

22 Feb, 2020 1 commit

Support legacy optimizer fusion as config option (#75) · 6d602065

Olatunji Ruwase authored Feb 21, 2020

* Support legacy optimizer fusion as config option

* Configure for legacy optimizer fusion

* Update configuration jsons for new apex

6d602065

20 Feb, 2020 1 commit
- Refactor simple model test, fix pythonpath issue (#96) · 001abe23
  Jeff Rasley authored Feb 20, 2020
```
Also a fix for #94 
```
  001abe23
15 Feb, 2020 1 commit
- Fix issue with empty grads for non-fused optimizers (#83) · 807480a0
  Jeff Rasley authored Feb 14, 2020
```
bug fixes for adamw/lamb and corresponding tests
```
  807480a0
14 Feb, 2020 1 commit

Porting BingBertSquad test (#70) · 37ff62cc

Shaden Smith authored Feb 14, 2020

* Porting BingBertSquad test

* Updating default paths.

* Enable model tests.

* Updating DeepSpeedExamples submodule

* Adding BingBertSquad's log uploads.

* Messed up the submodule again :-)

37ff62cc

12 Feb, 2020 1 commit
- remove the undefined variable in ckpt testing (#67) · 4f7d016d
  eltonzheng authored Feb 12, 2020
  
  4f7d016d
10 Feb, 2020 1 commit
- Moving to major/minor/patch versioning. (#51) · 50ae149f
  Shaden Smith authored Feb 09, 2020
  
  50ae149f
07 Feb, 2020 1 commit

Samyamr/batchconfig (#33) · 5a0abc65

Samyam Rajbhandari authored Feb 07, 2020

* simplifying the batch config, using a single assert to test for validity and allowing for specifying only the micro batch size

* Simplifying Batch Config, Adding ability to specify batch using just micro_batch, and adding a bunch of unit tests

* ran formatting

* Typo fixes and added the config file

* reformatting

* path fixes

* removing print statements

5a0abc65

06 Feb, 2020 2 commits
- Improve doc string for add_XXX_arguments (#32) · 8326aff2
  Olatunji Ruwase authored Feb 06, 2020
```
Unit tests for add_XXX_arguments
```
  8326aff2
- Handle missing optional configuration fields correctly (#24) · af81f6f5
  Olatunji Ruwase authored Feb 06, 2020
```
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
```
  af81f6f5
05 Feb, 2020 1 commit

Enables NCCL backend in @distributed_test (#13) · 438aa017

Shaden Smith authored Feb 05, 2020

* Enables NCCL backend in @distributed_test

* Adds pytest-forked to avoid CUDA re-initialization issue.

* paste typo

* transcription typo

438aa017

04 Feb, 2020 4 commits
- add version check test (#9) · d6846203
  Jeff Rasley authored Feb 04, 2020
  
  d6846203
- add allreduce test (#7) · 52c5a936
  Jeff Rasley authored Feb 04, 2020
```
* add allreduce test

* comment out set rank to cuda for now

* switched back to gloo
```
  52c5a936
- Distributed testing (#6) · b61a2217
  Shaden Smith authored Feb 04, 2020
```
* Adds distributed_test decorator and some unit tests.

* Setting NCCL backend.

* Parametrizes test.

* rank -> local_rank

* Temporarily disable CUDA initialization.
```
  b61a2217
- Model tests executable fix · caaae992
  Shaden Smith authored Feb 03, 2020
  
  caaae992
03 Feb, 2020 3 commits
- examples -> DeepSpeedExamples · e9b097f1
  Shaden Smith authored Feb 03, 2020
  
  e9b097f1
- Update model examples path. · 9a5717e3
  Shaden Smith authored Feb 03, 2020
  
  9a5717e3
- add test model Megatron_GPT2 · 98f5131b
  Elton Zheng authored Feb 03, 2020
  
  98f5131b