Commits · b8be1bc7b663294a121194e51aeebad40c31d60e · OpenDAS / apex

17 Apr, 2021 1 commit

initial cublaslt support for MLP (#1080) · b8be1bc7

Burc Eryilmaz authored Apr 16, 2021



* initial cublaslt support

* 64 bit input

* add license headers

* cleanup

* remove license
Co-authored-by: pbialecki <pbialecki@nvidia.com>

b8be1bc7

15 Apr, 2021 1 commit

Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac

Sudhakar Singh authored Apr 15, 2021

* Add unit tests for fused-novograd

* Fix: tensors should reside on the same device

* Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test

* fixed issues mentioned in the comments

59d2f7ac

19 Oct, 2020 1 commit

Optimize the sync batchnorm by batching the communication (#980) · 8a1ed9e8

lly-zero-one authored Oct 19, 2020

In this PR, we mainly tried to optimize the performance of Syncatchnorm and also fixed one potential issue in the welford_parallel kernel implementation.

For performance improvement, we batched the mean/var/count all_gather communication together and sent it once in the forward path
We also batch the all_reduce in backward path
We add the contiguous call on the input of welford_parallel kernel.
If there is any standard perf benchmark, I would be happy to run it.

8a1ed9e8

05 Aug, 2020 1 commit

set device guard for multi tensor optimizer implementations (#927) · 274cc063

ngimel authored Aug 05, 2020

* add device guards to the optimizers

* add untracked file

* set deviceGuard in multi_tensor_apply

* address review comments; fix lamb

* indent

* typo

274cc063

06 Jul, 2020 1 commit

[sync BN] (#792) · 1ff54b8f

jjsjann123 authored Jul 06, 2020

* [sync BN]

support non-uniform batch size across process group.

TODO: test should be added once cleaned up.

* updating unit tests

* new unit tests for different inputs

* cleaning

1ff54b8f

23 May, 2020 1 commit
- fix function signature · 2be773d3
  Kexin Yu authored May 23, 2020
  
  2be773d3
22 May, 2020 5 commits
- more fixes on dtypes · cf918ac1
  Kexin Yu authored May 22, 2020
  
  cf918ac1
- use pointer · 06a83ce7
  Kexin Yu authored May 22, 2020
  
  06a83ce7
- .data<...>() · 3a727a01
  Kexin Yu authored May 21, 2020
  
  3a727a01
- at::Tensor::data_ptr() · 2c3f3d9a
  Kexin Yu authored May 21, 2020
  
  2c3f3d9a
- fix dtype · abc991da
  Kexin Yu authored May 21, 2020
  
  abc991da
21 May, 2020 1 commit
- make fused LAMB async · f54cc1c9
  Kexin Yu authored May 21, 2020
  
  f54cc1c9
14 May, 2020 1 commit
- Add FusedAdagrad (#822) · 3bae8c83
  Andrew Tulloch authored May 14, 2020
  
  3bae8c83
30 Apr, 2020 3 commits

fix function signature for LAMBStage2Functor · c8bcfff8
Kexin Yu authored Apr 30, 2020

c8bcfff8
enable wider load/store for multi_tensor_apply kernels (#763) · 17ee854e
Deyu Fu authored Apr 30, 2020
```
* modify MTA axpby for wider load/store

* Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
```
17ee854e

Improvements to apex.mlp (#804) · 31aceeaa

Deyu Fu authored Apr 30, 2020

* update fused bias relu backward kernel

* adding support for not require first layer dgrad

* fix bug: wrong layer in requires grad

* add infrastructure for optional bias and activation, currently only support no bias and no relu

* make bias and relu optional separately

* add sigmoid activation option

31aceeaa

28 Apr, 2020 1 commit
- LAMB: global grad clipping & more flexibility in adaptive lr · 5b300119
  Kexin Yu authored Apr 28, 2020
  
  5b300119
22 Apr, 2020 1 commit
- initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
  Deyu Fu authored Apr 22, 2020
  
  71511faf
10 Apr, 2020 1 commit
- Add no-flattening e5m2-allgather option · c7b34549
  Thor Johnsen authored Apr 09, 2020
  
  c7b34549
27 Feb, 2020 1 commit
- NHWC support for multi tensor apply (#732) · de6378f5
  mcarilli authored Feb 26, 2020
```
* NHWC support for multi tensor apply

* compilation fix for version<=1.4
```
  de6378f5
04 Oct, 2019 1 commit

move previous fused_adam and fp16_optimizer to contrib (#517) · 1904e48d

Deyu Fu authored Oct 04, 2019

* move previous fused_adam and fp16_optimizer to contrib

* make build contrib.fused_adam optional

* change build option name

* remove unnecessary try import

1904e48d

06 Sep, 2019 1 commit

Fix for #456 (#477) · 325f5a0b

mcarilli authored Sep 05, 2019

* Pushing for build tests

* Contrib files

* Removing deprecated checks

325f5a0b

20 Aug, 2019 1 commit
- add back lamb stage1/2 to amp_C python · b9f0995b
  Deyu Fu authored Aug 20, 2019
  
  b9f0995b
17 Aug, 2019 1 commit
- add back legacy lamb code for backward comptibility now · 2bc766ce
  Deyu Fu authored Aug 16, 2019
  
  2bc766ce
16 Aug, 2019 2 commits

clean up variance options support by all fused optimizers: · 18062b69

Deyu Fu authored Aug 16, 2019

correctly not apply bias correction to epsilon(same as recent upstream change)
correctly not apply bias correction to weight decay(consistent with upstream AdamW)
Make adam_w_mode for FusedAdam/LAMB, to do L2 or Weight Decay (Adam vs AdamW)
Correct document reg_inside_moment differently from adam_w_mode in FusedNovoGrad
Removed legacy eps_mode from FusedAdam
Make internal math type float across fused optimizers

18062b69

add fused lamb, put lamb kernels into one file · c8f9cceb
Deyu Fu authored Aug 16, 2019

c8f9cceb

08 Aug, 2019 1 commit
- initial commit to make fused optimizers compatible with AMP · 690b1f71
  Deyu Fu authored Aug 08, 2019
  
  690b1f71
06 Aug, 2019 1 commit

Clean up layer norm tests (#418) · 3ef01fae

ngimel authored Aug 06, 2019

* Bug fix for non-affine layer-norm + add backward unit test

* clean up tests and add tests for a large batch

3ef01fae

01 Aug, 2019 1 commit
- fix fused layer norm for >65535 batch · 4a8e1a87
  Natalia Gimelshein authored Aug 01, 2019
  
  4a8e1a87
26 Jul, 2019 1 commit
- Add missing semicolon. (#390) · 3f7f5fba
  Edward Z. Yang authored Jul 12, 2019
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
```
  3f7f5fba
12 Jul, 2019 1 commit
- Add missing semicolon. (#390) · 80e0143e
  Edward Z. Yang authored Jul 12, 2019
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
```
  80e0143e
03 Jul, 2019 4 commits
- Pulling in deprecation warning changes · 665b2dd7
  Michael Carilli authored Jul 03, 2019
  
  665b2dd7
- Remove deprecated Type.h · 816813f9
  Michael Carilli authored Jul 03, 2019
  
  816813f9
- Remove deprecated Type.h · 7096b1b7
  Michael Carilli authored Jul 03, 2019
  
  7096b1b7
- Changing AT_CHECK to TORCH_CHECK · adee29f6
  Michael Carilli authored Jul 03, 2019
  
  adee29f6
28 Jun, 2019 1 commit
- Add support for fp16 update term (new UPD_T typename in template) · 3aeea0d8
  Thor Johnsen authored Jun 28, 2019
  
  3aeea0d8
14 Jun, 2019 1 commit
- Separate LDG/STG from compute loop (#359) · 121a2500
  Thor Johnsen authored Jun 13, 2019
  
  121a2500
11 Jun, 2019 1 commit
- Allow multi_tensor_lamb to update fp16 params · 47e3367f
  Michael Carilli authored Jun 11, 2019
  
  47e3367f
31 May, 2019 2 commits

Multi tensor lamb optimizer (#334) · 8be5b6be

Thor Johnsen authored May 31, 2019

* First draft, for discussion

* Fix mistakes in LAMB equations

* Add loop over chunk

* Bug fix

* Bug fix

* Bug fix

* Undo bug fix

* Bug fix

* Add multi tensor LAMB optimizer to setup.py

* Rename step_size to learning_rate

* Fix compilation errors

8be5b6be

Give multi-tensor L2 norm the ability to compute norms per-tensor as well as globally (#333) · 93338e62

mcarilli authored May 31, 2019

* Existing tests passing, still need to add per-tensor tests

* Test is passing, still need to measure performance

* ILP for l2norm functor

93338e62