Commits · 9d86158d89aa79aacd600a1b2d9600a1b1ff11da · OpenDAS / apex

01 Sep, 2021 1 commit
- Merge pull request #1148 from azrael417/thorsten-view-fix · 9d86158d
  Kexin Yu authored Aug 31, 2021
```
wrapper function for flat view creation in _lazy_init_stage2
```
  9d86158d
31 Aug, 2021 3 commits
- Merge pull request #1151 from NVIDIA/spatial_fast_bottleneck · ed713c84
  Thor Johnsen authored Aug 31, 2021
```
Spatially Distributed Fast Bottleneck block
```
  ed713c84
- Add module tests · bbc95c0a
  Thor Johnsen authored Aug 31, 2021
  
  bbc95c0a
- First release · 2f164a2a
  Thor Johnsen authored Aug 31, 2021
  
  2f164a2a
30 Aug, 2021 1 commit
- Wrote a small wrapper function for flat view creation in _lazy_init_stage2 to... · 333da806
  Thorsten Kurth authored Aug 30, 2021
```
Wrote a small wrapper function for flat view creation in _lazy_init_stage2 to support channels last data formats
```
  333da806
20 Aug, 2021 1 commit
- include iostream (#1144) · d6b5ae5d
  X Wang authored Aug 20, 2021
  
  d6b5ae5d
17 Jul, 2021 2 commits

Added more fusion and vectorized kernel for transducer (#1125) · 0c2c6eea

Nan Zheng authored Jul 17, 2021

* Added support for fused ReLU and dropout into transducer joint

* Reorganized code selection path in transducer joint fwd
* Added support for fused ReLU+dropout into transducer joint

* Vectorize transducer loss backward with fused softmax (#3)

* Nanz/transducer loss (#4)

* Vectorize transducer loss backward with fused softmax

* Added a predicate to avoid potential IMA

* Nanz/transducer loss (#5)

* Vectorize transducer loss backward with fused softmax

* Added a predicate to avoid potentional IMA

* Added more predicates to avoid IMAs

* Updated documentations for newly added features.

* Fixed a error in transducer.py

0c2c6eea

Adds small-batch kernels (#1126) · ed719967
yjk21 authored Jul 17, 2021

ed719967

16 Jul, 2021 1 commit
- local_rank fix (#1129) · c1378e6f
  X Wang authored Jul 16, 2021
```
* local_rank and install cuda version fix
```
  c1378e6f
15 Jun, 2021 2 commits
- Merge pull request #1118 from FDecaYed/deyuf/update_adam_string · d06404fe
  Deyu Fu authored Jun 15, 2021
```
update fusedadam docstring
```
  d06404fe
- update fusedadam docstring · ab520f82
  Deyu Fu authored Jun 14, 2021
  
  ab520f82
26 May, 2021 1 commit

Distributed LAMB: Clip grads before reduce_scatter/all_reduce (#1099) · ebcd7f08

Kexin Yu authored May 25, 2021

* clip before reduce scatter

* provide clip before/after RS option

* change to clip after ar (avoid confusion)

* fix comments

ebcd7f08

17 May, 2021 1 commit
- compile cublasLt code only for cublas >= 11.0 (#1108) · 00c1e56d
  Burc Eryilmaz authored May 17, 2021
```
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  00c1e56d
19 Apr, 2021 1 commit
- Fix cublasLt context create/destroy overhead in MLP extension (#1083) · 082f999a
  Burc Eryilmaz authored Apr 19, 2021
```
* don't create cublasLt handle, fix zero block size case

* cleanup
```
  082f999a
17 Apr, 2021 3 commits

initial cublaslt support for MLP (#1080) · b8be1bc7

Burc Eryilmaz authored Apr 16, 2021



* initial cublaslt support

* 64 bit input

* add license headers

* cleanup

* remove license
Co-authored-by: pbialecki <pbialecki@nvidia.com>

b8be1bc7

remove MIT license (#1081) · b5eb38db
ptrblck authored Apr 16, 2021

b5eb38db

Adding fast bottleneck implementation into contrib (#1079) · 705cba91

Deyu Fu authored Apr 17, 2021



* initial commit for adding fast bottleneck

* sync cudnn-frontend module
Co-authored-by: pbialecki <pbialecki@nvidia.com>

705cba91

16 Apr, 2021 1 commit
- adds fmhalib (#1074) · 5c9b21d8
  yjk21 authored Apr 16, 2021
  
  5c9b21d8
15 Apr, 2021 3 commits

Update README.md (#1067) · e5f2f675
Jay Rodge authored Apr 15, 2021
```
Fixed a typo
```
e5f2f675

DistributedFusedLAMB: enable no_copy and add barrier for SHARP (#1075) · bb791585

Kexin Yu authored Apr 15, 2021



* enable no_copy

* barrier for SHARP

* set verbose=False by default
Co-authored-by: Kexin Yu <kexiny@nvidia.com>

bb791585

Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac

Sudhakar Singh authored Apr 15, 2021

* Add unit tests for fused-novograd

* Fix: tensors should reside on the same device

* Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test

* fixed issues mentioned in the comments

59d2f7ac

24 Mar, 2021 2 commits

sync-free Distributed LAMB + parameter reordering (#1055) · a651e2c2

Kexin Yu authored Mar 24, 2021



* sync free Distributed LAMB

* init lr with provided value

* wait l2 norm strem

* reorder param

* fix indent
Co-authored-by: Kexin Yu <kexiny@nvidia.com>

a651e2c2

Initial check-in of the transducer extensions (#1069) · d86d1b09

Nan Zheng authored Mar 23, 2021

* Initial check-in of the transducer extension.

* Added more comments to help explain the code

* Corrected minor typos

* 1. Renamed variable in tests to match the extension
2. Disabled ninja build option

d86d1b09

23 Feb, 2021 1 commit
- fast layer norm (#1037) · e2083df5
  yjk21 authored Feb 23, 2021
  
  e2083df5
10 Feb, 2021 1 commit

fix import container_abcs issue (#1049) · a78ccf0b

Shoufa Chen authored Feb 10, 2021

* copy-paste friendly

* fix import container_abcs issue

Nightly PyTorch has removed `container_abcs` from `torch._six`.  https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35

* fix import container_abcs issue

Nightly PyTorch has removed `container_abcs` from `torch._six`.
https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35

* keep existing for pytorch1.7 and earlier

a78ccf0b

20 Jan, 2021 1 commit
- cuda rng changes for graph capture with apex MHA (#1025) · eefb1ba2
  Burc Eryilmaz authored Jan 20, 2021
```
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  eefb1ba2
17 Dec, 2020 2 commits

Merge pull request #1015 from jpool-nv/patch-1 · 154c6336
Thor Johnsen authored Dec 17, 2020
```
Update ASP README to highlight default recipe
```
154c6336

Update ASP README to highlight default recipe · 56914d4f

jpool-nv authored Dec 17, 2020

The Recipe was presented after some non-standard API calls, so moving the suggested usage up, giving it its own section, and reinforcing the suggested usage in the non-standard section.

56914d4f

04 Dec, 2020 3 commits
- remove noise pip-version-check noise that hides the outcome of the build (#998) · 8cf5ae61
  Stas Bekman authored Dec 04, 2020
  
  8cf5ae61
- Distributed LAMB fixes (#1007) · 8a80d478
  Kexin Yu authored Dec 03, 2020
```
* add flag for DistributedAdam: step_support_amp_scaling
Co-authored-by: Kexin Yu <kexiny@nvidia.com>
Co-authored-by: Kexin Yu <kexinznzn@gmail.com>
```
  8a80d478
- Seryilmaz/fused dropout softmax (#985) · 3fe10b55
  Burc Eryilmaz authored Dec 03, 2020
```
* fuse dropout into softmax in fprop for additive mask case
```
  3fe10b55
02 Dec, 2020 1 commit

Fix lack of proper loading of best_prec1 from the checkpoint (#1000) · 6c186b3b

Janusz Lisiecki authored Dec 02, 2020



- resume() is a nested function and when it loads best_prec1
  it creates a local variable that hides the one from the parent
  function (which refers to the global one). This PR adds `global`
  to modify the global variable as intended
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

6c186b3b

01 Dec, 2020 1 commit

DistributedFusedAdam Model Parallelism Support (Megatron) (#981) · 6b7e77b0

Kexin Yu authored Dec 01, 2020



DistributedFusedAdam Model Parallelism Support (Megatron)
Co-authored-by: Kexin Yu <kexiny@nvidia.com>
Co-authored-by: Kexin Yu <kexinznzn@gmail.com>

6b7e77b0

19 Oct, 2020 1 commit

Optimize the sync batchnorm by batching the communication (#980) · 8a1ed9e8

lly-zero-one authored Oct 19, 2020

In this PR, we mainly tried to optimize the performance of Syncatchnorm and also fixed one potential issue in the welford_parallel kernel implementation.

For performance improvement, we batched the mean/var/count all_gather communication together and sent it once in the forward path
We also batch the all_reduce in backward path
We add the contiguous call on the input of welford_parallel kernel.
If there is any standard perf benchmark, I would be happy to run it.

8a1ed9e8

29 Sep, 2020 1 commit
- use reshape instead of view (#971) · a109f856
  ptrblck authored Sep 28, 2020
  
  a109f856
15 Sep, 2020 1 commit
- Merge pull request #959 from a-maci/update-ASP-readme · 4a1fa2c4
  Thor Johnsen authored Sep 15, 2020
```
Update asp readme
```
  4a1fa2c4
14 Sep, 2020 2 commits
- Merge pull request #5 from a-maci/a-maci-patch-update-asp-readme · 48fc613d
  Asit authored Sep 14, 2020
```
Update README for ASP
```
  48fc613d
- Update README for ASP · e3794f42
  Asit authored Sep 14, 2020
```
Added an outline to illustrate our recommended recipe to obtain a pruned model
```
  e3794f42
15 Aug, 2020 1 commit
- Should pass stricter stride/size checks in pytorch (#942) · 4ef930c1
  mcarilli authored Aug 14, 2020
  
  4ef930c1
10 Aug, 2020 1 commit
- move sm80 code inside MHA (#937) · 5d9b5cbc
  ptrblck authored Aug 10, 2020
```
Co-authored-by: pbialecki <pbialecki@nvidia.com>
```
  5d9b5cbc