Commits · e1aa1fc1316a84e66869666270941265ec9cde77 · OpenDAS / apex

10 Feb, 2022 1 commit
- 8.6 requires CUDA 11.1 (#1289) · e1aa1fc1
  Masaki Kozuki authored Feb 10, 2022
  
  e1aa1fc1
01 Feb, 2022 1 commit

Add the permutation related support as the extension for asp lib. (#1194) · 89edb819

ChongyuNVIDIA authored Feb 02, 2022

* Add the permutation related support as the extension for asp lib.

* [Fix] Track the permutation sequence for progressive channel swap strategy.

* Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings.

* Fix the deprecated functions in ASP unit tests.

* Fix the sparsity info typo in ASP lib.

* [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search.

* Update the README.md with identical random seed setting and NeurIPS info.

* Integrate the Pybind11 enhancement of permutation search into ASP lib.

89edb819

19 Jan, 2022 1 commit
- pass flags to transducer joint kernel (#1273) · c4e85f7b
  Masaki Kozuki authored Jan 18, 2022
  
  c4e85f7b
13 Jan, 2022 1 commit
- support new path to CUDAGeneratorImpl.h (#1267) · b2fdf9c4
  Shintaro Iwasaki authored Jan 13, 2022
  
  b2fdf9c4
16 Dec, 2021 1 commit
- version guard (#1253) · e8473822
  Masaki Kozuki authored Dec 16, 2021
  
  e8473822
15 Dec, 2021 1 commit
- Add `--threads 4` to `extra_compile_args["nvcc"]` (#1251) · f63dac80
  Masaki Kozuki authored Dec 15, 2021
```
* apply formatter & remove duplicate func def

* dry CUDA_HOME None check

* `--threads 4`
```
  f63dac80
14 Dec, 2021 1 commit

Faster `--fast_multihead_attn` build (#1245) · 7ec8ed67

Masaki Kozuki authored Dec 14, 2021

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

7ec8ed67

09 Dec, 2021 1 commit

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

27 Oct, 2021 1 commit

`FastLayerNorm` compat with `autocast` (#1203) · ae757634

Masaki Kozuki authored Oct 27, 2021



* Persistent LayerNorm: Multi-CTA Rewrite

* autocast support
Co-authored-by: Young-Jun Ko <youngjun.ko@gmail.com>

ae757634

02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

08 Sep, 2021 1 commit

enable ninja (#1164) · 9ce0a10f

Masaki Kozuki authored Sep 08, 2021

- passing include directories to `CUDAExtension`'s `include_dirs` argument
- removing `-I/path/to/dir` arguments from `extra_compile_args`

9ce0a10f

01 Sep, 2021 2 commits

Seryilmaz/fuse norm into scale (#1149) · 4d190db6

Burc Eryilmaz authored Sep 01, 2021



* fuse norm into scale

* add fused norm into dlamb
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

4d190db6

Seryilmaz/more cublas lt (#1147) · 6af09dd9

Burc Eryilmaz authored Aug 31, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6af09dd9

17 Jul, 2021 2 commits

Added more fusion and vectorized kernel for transducer (#1125) · 0c2c6eea

Nan Zheng authored Jul 17, 2021

* Added support for fused ReLU and dropout into transducer joint

* Reorganized code selection path in transducer joint fwd
* Added support for fused ReLU+dropout into transducer joint

* Vectorize transducer loss backward with fused softmax (#3)

* Nanz/transducer loss (#4)

* Vectorize transducer loss backward with fused softmax

* Added a predicate to avoid potential IMA

* Nanz/transducer loss (#5)

* Vectorize transducer loss backward with fused softmax

* Added a predicate to avoid potentional IMA

* Added more predicates to avoid IMAs

* Updated documentations for newly added features.

* Fixed a error in transducer.py

0c2c6eea

Adds small-batch kernels (#1126) · ed719967
yjk21 authored Jul 17, 2021

ed719967

17 Apr, 2021 1 commit

Adding fast bottleneck implementation into contrib (#1079) · 705cba91

Deyu Fu authored Apr 17, 2021



* initial commit for adding fast bottleneck

* sync cudnn-frontend module
Co-authored-by: pbialecki <pbialecki@nvidia.com>

705cba91

16 Apr, 2021 1 commit
- adds fmhalib (#1074) · 5c9b21d8
  yjk21 authored Apr 16, 2021
  
  5c9b21d8
24 Mar, 2021 1 commit

Initial check-in of the transducer extensions (#1069) · d86d1b09

Nan Zheng authored Mar 23, 2021

* Initial check-in of the transducer extension.

* Added more comments to help explain the code

* Corrected minor typos

* 1. Renamed variable in tests to match the extension
2. Disabled ninja build option

d86d1b09

23 Feb, 2021 1 commit
- fast layer norm (#1037) · e2083df5
  yjk21 authored Feb 23, 2021
  
  e2083df5
01 Dec, 2020 1 commit

DistributedFusedAdam Model Parallelism Support (Megatron) (#981) · 6b7e77b0

Kexin Yu authored Dec 01, 2020



DistributedFusedAdam Model Parallelism Support (Megatron)
Co-authored-by: Kexin Yu <kexiny@nvidia.com>
Co-authored-by: Kexin Yu <kexinznzn@gmail.com>

6b7e77b0

10 Aug, 2020 1 commit
- move sm80 code inside MHA (#937) · 5d9b5cbc
  ptrblck authored Aug 10, 2020
```
Co-authored-by: pbialecki <pbialecki@nvidia.com>
```
  5d9b5cbc
01 Aug, 2020 1 commit
- Add sm80 for CUDA >= 11 (#925) · 5b53121a
  ptrblck authored Aug 01, 2020
  
  5b53121a
01 Jun, 2020 1 commit
- Add Pyprof removal warnings that point to new repo (#862) · 097238f8
  mcarilli authored Jun 01, 2020
```
Co-authored-by: Michael Carilli <mcarilli@nvidia.com>
```
  097238f8
30 May, 2020 2 commits
- Make separate apex option for distributed lamb · 45388d48
  Thor Johnsen authored May 30, 2020
  
  45388d48
- Distributed LAMB optimizer · 19892f1d
  Thor Johnsen authored May 30, 2020
  
  19892f1d
29 May, 2020 1 commit

Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add... · 6c2babf9

Burc Eryilmaz authored May 29, 2020


Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add additive mask support, separate Q/K/V parameters (#854)
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6c2babf9

14 May, 2020 1 commit
- Add FusedAdagrad (#822) · 3bae8c83
  Andrew Tulloch authored May 14, 2020
  
  3bae8c83
23 Apr, 2020 1 commit

CUDAGenerator fix for #36026 (#801) · 1f2aa915

ptrblck authored Apr 22, 2020



* add CUDAGenerator guard

* fix generator_flag

* add guards for gen pointer/ref issue

* change mutex_ to mutex()

* add check_generator
Co-authored-by: pbialecki <pbialecki@nvidia.com>

1f2aa915

22 Apr, 2020 1 commit
- initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
  Deyu Fu authored Apr 22, 2020
  
  71511faf
23 Mar, 2020 1 commit
- add l2norm source for FusedLAMB · a3ffb8a7
  Kexin Yu authored Mar 23, 2020
  
  a3ffb8a7
20 Mar, 2020 2 commits
- extension name fix · b4c32010
  Kexin Yu authored Mar 20, 2020
  
  b4c32010
- apex.contrib.optimizers.FuseLamb first commit · b222ed2b
  Kexin Yu authored Mar 19, 2020
  
  b222ed2b
11 Mar, 2020 1 commit

Fix deprecated calls in multihead_attn and ninja build failure (#746) · 80b90b9d

ptrblck authored Mar 11, 2020



* disable ninja for multihead_attn

* fix getCurrentStream in multihead_attn
Co-authored-by: pbialecki <pbialecki@nvidia.com>

80b90b9d

02 Mar, 2020 1 commit
- Revert "remove gencode from multihead_attn build (#731)" · 5633f6db
  pbialecki authored Mar 01, 2020
```
This reverts commit 92b3b9a9.
```
  5633f6db
27 Feb, 2020 1 commit
- NHWC support for multi tensor apply (#732) · de6378f5
  mcarilli authored Feb 26, 2020
```
* NHWC support for multi tensor apply

* compilation fix for version<=1.4
```
  de6378f5
25 Feb, 2020 2 commits
- remove gencode from multihead_attn build (#731) · 92b3b9a9
  ptrblck authored Feb 25, 2020
  
  92b3b9a9
- remove duplicated multihead_attn install (#729) · 5f6b9b0e
  ptrblck authored Feb 24, 2020
  
  5f6b9b0e
24 Feb, 2020 1 commit

Change to Multihead Attention to allow Batched GEMMs larger than 64K. (#728) · 1733946a

Kevin Stephano authored Feb 24, 2020

* Adding C++ Multihead Attention implementation to contrib.

* Add reference test that at least works for forward.

* Remove CublasLt support from multihead attention.

* Add new Python version of self attention.

* Update python model of MHA with backward pass.

* Fixed Output Linear connection in MHA.

* Clean up compiles and add documentation to PySelfAttention.

* Add Encdec Python version of multihead attention.  Cleanup files.

* Tests for self and encdec multihead attention.

* Add reference pytorch implementation of attention with norm and add.

* Add cutlass branch definition.

* Add cutlass download to compile.

* Add norm/add tests.

* Add biases to pytorch python versions.

* Add tests and fix issues with python version of attention masking.

* Create README.md

* Update README.md

* Update README.md

* Update perf test parameters.

* Update README.md

* Update README.md

* Update README.md

* Add files via upload

* Update README.md

* Update README.md

* Update README.md

* Fix matmul1 output tensor size.  Fix tests that missed issue.

* Allow for Z dimensions of 64K and greater on batched GEMMs.

* remove redundant imports

* general cleanup, remove deprecated or unused functions

1733946a

15 Feb, 2020 1 commit
- change include_dirs to abs path (#719) · 50338df6
  Deyu Fu authored Feb 14, 2020
  
  50338df6
06 Feb, 2020 1 commit

Add Fast Multihead Attention to APEX Contrib (#697) · 3f94528e

Kevin Stephano authored Feb 06, 2020

* Adding C++ Multihead Attention implementation to contrib.

* Add reference test that at least works for forward.

* Remove CublasLt support from multihead attention.

* Add new Python version of self attention.

* Update python model of MHA with backward pass.

* Fixed Output Linear connection in MHA.

* Clean up compiles and add documentation to PySelfAttention.

* Add Encdec Python version of multihead attention.  Cleanup files.

* Tests for self and encdec multihead attention.

* Add reference pytorch implementation of attention with norm and add.

* Add cutlass branch definition.

* Add cutlass download to compile.

* Add norm/add tests.

* Add biases to pytorch python versions.

* Add tests and fix issues with python version of attention masking.

* Create README.md

* Update README.md

* Update README.md

* Update perf test parameters.

* Update README.md

* Update README.md

* Update README.md

* Add f...

3f94528e