Commits · 29b36315f29189331acbb2e14e1718333d53f7de · OpenDAS / apex

13 Apr, 2022 1 commit

Cherry-picked the commit from upstream for faster --fast_multihead_attn build (#76) · 29b36315

Hubert Lu authored Apr 13, 2022



* Faster `--fast_multihead_attn` build (#1245)

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

* Fix some bugs
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

29b36315

06 Apr, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142

Hubert Lu authored Apr 06, 2022

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)

* First attempt to make rocblas flag backward compatible

* Fix some bugs

* Fix some bugs

* Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions

* Add groupbn extension unit tests for ROCm

* Fix some bugs

5ecad142

11 Mar, 2022 1 commit
- Updated the handling of CUDAGeneratorImpl.h to new path · 7bef81f7
  Pruthvi Madugundu authored Mar 11, 2022
  
  7bef81f7
28 Jan, 2022 1 commit
- Cherry-pick b2fdf9c4 from upstream Apex and resolve conflicts (#68) · 5de49cc9
  Jithun Nair authored Jan 28, 2022
  
  5de49cc9
09 Dec, 2021 1 commit

Add fused mixed precision lamb optimizer. (#1237) · d11ddccf

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

d11ddccf

03 Dec, 2021 1 commit
- Add IS_ROCM_PYTORCH if statement for some newly-added extensions · 39a65c92
  hubertlu-tw authored Dec 03, 2021
  
  39a65c92
02 Dec, 2021 1 commit

Enable all supported CUDA extensions using --cuda_ext flag (#59) · 1e0f9bc6

Jithun Nair authored Dec 02, 2021

* Use --cuda_ext flag to build all supported extensions

* Don't remove --cuda_ext since it'll be needed to build other extensions

* Need to clear all cmdline args so setup.py doesn't complain

1e0f9bc6

02 Nov, 2021 1 commit
- Update setup.py · 62f06964
  Hubert Lu authored Nov 02, 2021
```
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
```
  62f06964
27 Oct, 2021 1 commit

`FastLayerNorm` compat with `autocast` (#1203) · ae757634

Masaki Kozuki authored Oct 27, 2021



* Persistent LayerNorm: Multi-CTA Rewrite

* autocast support
Co-authored-by: Young-Jun Ko <youngjun.ko@gmail.com>

ae757634

21 Oct, 2021 1 commit
- updates to MHA, compilation still broken · 88eee5fe
  Jeff Daily authored Oct 21, 2021
  
  88eee5fe
19 Oct, 2021 2 commits
- scaled_upper_triang_masked_softmax_cuda and scaled_masked_softmax_cuda in --cuda_ext are skipped · 203e3231
  Hubert Lu authored Oct 19, 2021
  
  203e3231
- Enable the following modules in apex/contrib: · 1fd257e2
  Abhishree authored Oct 19, 2021
```
1) multihead_attn
2) xentropy
3) fused_adam and distributed_fused_adam
```
  1fd257e2
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

08 Sep, 2021 1 commit

enable ninja (#1164) · 9ce0a10f

Masaki Kozuki authored Sep 08, 2021

- passing include directories to `CUDAExtension`'s `include_dirs` argument
- removing `-I/path/to/dir` arguments from `extra_compile_args`

9ce0a10f

07 Sep, 2021 1 commit

Enable group batch norm (--bnp) on ROCm (only bn_group = 1) (#51) · e57c84e0

sarunyap authored Sep 07, 2021

* Enable group batch norm (--bnp) on ROCm (only bn_group = 1)

Enable NHWC group batch norm on a single GPU on ROCm (bn_group = 1).
The multi-GPU case (bn_group > 1) will be revisited in the future.

The following are the main changes:

1) Use MIOpen data structures/functions in HIP instead of CUDNN
2) For the warp-level primitive code, we ensure that the code operates
   on 64-thread wide warp instead of 32-thread wide
3) Disable all the bn_group > 1 paths

Notes:

1) Multi-stream is not tested.
2) We have not optimized for performance

* Fix bnp hipification

Avoid calling hipify-perl in setup.py and rely on PyTorch's internal
hipification mechanism.

* Make bnp data pointers contiguous

The contrib group batch norm implementation assumes that all input
tensors are contiguous.  When non-contiguous tensors are passed to the
function, it gives a wrong result.  This commit explicitly calls
.contiguous() to make all input tensors contiguous before accessing
them.

* Fix HIP lane id in bnp

Fix typo

* Fix ReLU bitmask for HIP in bnp

The ReLU bitmask is derived by using the __ballot function which returns
a 64-bit value in HIP.  This commit fixes the ReLU bitmask storage size
and offsets on ROCm.

This patch also fixes the kernel to set ReLU bitmask to 1 when the data
is less than or equal to zero (not only less than).  Not doing so can
cause a stability issue.

* Remove multiple of 64 offset for HIP in bnp

The multiple of 64 offset is not necessary.

* Use FP16 intermediate output to determine whether to rectify in bnp

Group batch norm takes FP16 tensors and produces the FP16 output,
however, all arithmetic operations are done in FP32, thus intermediate
outputs are in FP32.  For the fusion kernels, ReLU determines the FP32
intermediate output to decide whether to rectify it.  ReLU must rectify
the intermediate output if it is less than or "equal" to zero.  There is
a chance that the intermediate FP32 output is very close to zero, and
when it is converted to FP16, it becomes zero.  In this case, this
output is not rectified when it should be.  Since the output is not
rectified in the forward pass, the gradient is not rectified in the
backward pass.  This can cause a stability issue.

This patch can have a negative impact on the performance of group batch
norm as we perform FP32-FP16 conversion multiple times.

* Disable dispatchX ParallelSums in HIP in bnp

dispatchX is not required for the bn_group = 1 case.

* Use traditional load/store for HIP in bnp

The built-in function has a high floating point rounding error.  Thus,
we replace it with the traditional load/store.  Doing so breaks the
aligned pointer property in the load/store functions.  We conservatively
use traditional load/store for all memory access.

* Replace shfl_down with shfl_sync in parallel sums for HIP in bnp

This commit separates the HIP code from the CUDA code in parallel sums

* Remove -U__HIP_NO_HALF_CONVERSIONS__ for HIP in bnp

Since the built-in function is removed, -U__HIP_NO_HALF_CONVERSIONS__ is
no longer needed.

* Preserve CUDA's ReLU condition path for USE_ADD_RELU in bnp

* Add test for bnp

The test evaluates correctness of batch norm, batch norm + ReLU, and
batch norm + add + ReLU against the reference implementation.

For the forward activation output, we validate it against the PyTorch's
implementation.  The group batch norm activation output must be allclose
with the PyTorch activation output for the test to pass.

For the backward gradient output, we validate it against the Python
implementation.  Due to the floating point rounding error in the batch
norm implementation, the group batch norm gradient output might not be
allclose with the Python implementation output when ReLU is being used
although the majority of the elements are very close to each other.
Thus, we use the norm difference threshold to determine whether the test
is passed or failed instead of allclose.

* Use the warp size variable than hard coding the warp size in bnp

Use C10_WARP_SIZE from c10/macros/Macros.h in the host functions and use
warpSize in the device kernels instead of hard coding the warp size.

e57c84e0

01 Sep, 2021 3 commits

work around hipify not finding headers · 888e72ad
Jeff Daily authored Sep 01, 2021

888e72ad

Seryilmaz/fuse norm into scale (#1149) · 4d190db6

Burc Eryilmaz authored Sep 01, 2021



* fuse norm into scale

* add fused norm into dlamb
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

4d190db6

Seryilmaz/more cublas lt (#1147) · 6af09dd9

Burc Eryilmaz authored Aug 31, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6af09dd9

31 Aug, 2021 1 commit
- enable --distributed_lamb for rocm · 955256d1
  Jeff Daily authored Aug 31, 2021
  
  955256d1
17 Jul, 2021 2 commits

Added more fusion and vectorized kernel for transducer (#1125) · 0c2c6eea

Nan Zheng authored Jul 17, 2021

* Added support for fused ReLU and dropout into transducer joint

* Reorganized code selection path in transducer joint fwd
* Added support for fused ReLU+dropout into transducer joint

* Vectorize transducer loss backward with fused softmax (#3)

* Nanz/transducer loss (#4)

* Vectorize transducer loss backward with fused softmax

* Added a predicate to avoid potential IMA

* Nanz/transducer loss (#5)

* Vectorize transducer loss backward with fused softmax

* Added a predicate to avoid potentional IMA

* Added more predicates to avoid IMAs

* Updated documentations for newly added features.

* Fixed a error in transducer.py

0c2c6eea

Adds small-batch kernels (#1126) · ed719967
yjk21 authored Jul 17, 2021

ed719967

25 Jun, 2021 1 commit
- Make torch version check numeric · 799785ab
  Jithun Nair authored Jun 25, 2021
  
  799785ab
17 Apr, 2021 1 commit

Adding fast bottleneck implementation into contrib (#1079) · 705cba91

Deyu Fu authored Apr 17, 2021



* initial commit for adding fast bottleneck

* sync cudnn-frontend module
Co-authored-by: pbialecki <pbialecki@nvidia.com>

705cba91

16 Apr, 2021 1 commit
- adds fmhalib (#1074) · 5c9b21d8
  yjk21 authored Apr 16, 2021
  
  5c9b21d8
24 Mar, 2021 1 commit

Initial check-in of the transducer extensions (#1069) · d86d1b09

Nan Zheng authored Mar 23, 2021

* Initial check-in of the transducer extension.

* Added more comments to help explain the code

* Corrected minor typos

* 1. Renamed variable in tests to match the extension
2. Disabled ninja build option

d86d1b09

23 Feb, 2021 1 commit
- fast layer norm (#1037) · e2083df5
  yjk21 authored Feb 23, 2021
  
  e2083df5
21 Jan, 2021 1 commit
- fix cross-compiled ROCm builds when no GPUs detected (#45) · c1e88fae
  Jeff Daily authored Jan 21, 2021
  
  c1e88fae
18 Jan, 2021 1 commit

update setup.py to more closely align with upstream · 2332c4d6

Jeff Daily authored Jan 18, 2021

Mostly whitespace or formatting issues addressed.
Diff with upstream is reduced; ROCm changes are more clear.

2332c4d6

16 Dec, 2020 1 commit
- update readme and minor changes · 3fdb8db9
  lcskrishna authored Dec 16, 2020
  
  3fdb8db9
15 Dec, 2020 3 commits
- fixed spelling mistakes · 8efd60b2
  lcskrishna authored Dec 15, 2020
  
  8efd60b2
- fix compile args for multi-tensor extension · f4ad42c1
  lcskrishna authored Dec 14, 2020
  
  f4ad42c1
- refactor based on latest hipify revamp · 91003340
  lcskrishna authored Dec 14, 2020
  
  91003340
10 Dec, 2020 1 commit
- cleanup of extensions · 539bad24
  lcskrishna authored Dec 10, 2020
  
  539bad24
09 Dec, 2020 2 commits
- updated hipify changes for apex contrib · 9b4c68c7
  lcskrishna authored Dec 08, 2020
  
  9b4c68c7
- update setup file for rocm due to newer hipify changes · ef209a74
  lcskrishna authored Dec 08, 2020
  
  ef209a74
01 Dec, 2020 1 commit

DistributedFusedAdam Model Parallelism Support (Megatron) (#981) · 6b7e77b0

Kexin Yu authored Dec 01, 2020



DistributedFusedAdam Model Parallelism Support (Megatron)
Co-authored-by: Kexin Yu <kexiny@nvidia.com>
Co-authored-by: Kexin Yu <kexinznzn@gmail.com>

6b7e77b0

18 Aug, 2020 1 commit

[contrib] Support for xentropy extension. (#34) · 3344233f

Chaitanya Sri Krishna Lolla authored Aug 18, 2020

* enable deprecated fused adam optimizer

* enable deprecated fused lamb

* enable xentropy extension

* add warpsize 32 for nv and 64 for amd

* update compiler arguments

* update the syncwarp conditions

* update syncwarp condition

3344233f

17 Aug, 2020 1 commit

[contrib] Support optimizers on rocm. (#33) · 17fbbf91

Chaitanya Sri Krishna Lolla authored Aug 17, 2020

* enable deprecated fused adam optimizer

* enable deprecated fused lamb

* reset the compiler arguments

* syntax error

* aligning the compiler arguments

17fbbf91

10 Aug, 2020 1 commit
- move sm80 code inside MHA (#937) · 5d9b5cbc
  ptrblck authored Aug 10, 2020
```
Co-authored-by: pbialecki <pbialecki@nvidia.com>
```
  5d9b5cbc
05 Aug, 2020 1 commit

Enable mlp_cuda extension. (#28) · d2f6d04a

Chaitanya Sri Krishna Lolla authored Aug 05, 2020

* enable mlp cuda

* add setup changes and tests

* skip the unit tests

* updated conditions for empty array

* removed hip platform conditions

d2f6d04a