Commits · 40e1536215da468e629a3d98f4fc5c751aa87610 · OpenDAS / apex

23 Aug, 2022 1 commit
- add customized fused op index mulitiplication (#1438) · 40e15362
  hanbao authored Aug 02, 2022
```
Co-authored-by: Han Bao <hbao@nvidia.com>
```
  40e15362
07 Jul, 2022 1 commit
- Remove `pyprof` and `reparameterization` (#1404) · 8a7a3325
  Masaki Kozuki authored Jul 06, 2022
```
* remove pyprof

* remove reparameterization

* remove pyprof test

* clean up
```
  8a7a3325
31 May, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming (#79) · cf77e9b5

Hubert Lu authored May 31, 2022

* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming

* Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement

cf77e9b5

21 Apr, 2022 1 commit

Give Some Extensions Version Guard in Build&Runtime (#1358) · f9305e75

Masaki Kozuki authored Apr 21, 2022

* guard

* update

* remove unnecessary version guard

* runtime version guard

* cosmetic

* skip tests appropriately

f9305e75

19 Apr, 2022 1 commit
- [submodule update] Bump cudnn-frontend to v0.6.1 (#1353) · d89f5e66
  Masaki Kozuki authored Apr 18, 2022
```
* bump version

* add guard

* fix the cond
```
  d89f5e66
15 Apr, 2022 1 commit

Apex transformer (#77) · 27a47345

Hubert Lu authored Apr 14, 2022

* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda

* Comment out CUDA-specific implementations

* Resolve filename collision of *cpp files with to-hipify code and *cu files

27a47345

13 Apr, 2022 1 commit

Cherry-picked the commit from upstream for faster --fast_multihead_attn build (#76) · 29b36315

Hubert Lu authored Apr 13, 2022



* Faster `--fast_multihead_attn` build (#1245)

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

* Fix some bugs
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

29b36315

06 Apr, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142

Hubert Lu authored Apr 06, 2022

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)

* First attempt to make rocblas flag backward compatible

* Fix some bugs

* Fix some bugs

* Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions

* Add groupbn extension unit tests for ROCm

* Fix some bugs

5ecad142

05 Apr, 2022 2 commits
- Rename nccl_p2p extension to nccl_p2p_cuda · d8db8c15
  Thor Johnsen authored Apr 05, 2022
  
  d8db8c15
- Rename peer_memory extension to peer_memory_cuda · 6e7e2d90
  Thor Johnsen authored Apr 05, 2022
  
  6e7e2d90
30 Mar, 2022 1 commit

Conv-Bias-ReLU fusion (#1332) · 23cfb576

Gil Shomron authored Mar 30, 2022



* Enabled Conv-Bias-ReLU fusion

The following modules are enabled using cuDNN runtime fusion:
1) Conv-Bias-ReLU (+backward)
2) Conv-Bias (+backward)
3) Conv-Bias-Mask-ReLU (+backward)

* Casts cleanup and autocast in unittest

- Remove redundant dtype casts
- Simulate the usage in the unittest by using torch.cuda.amp.autocast
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

* Fixed save_for_backward
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: root <root@luna-0277.selene.nvidia.com>

23cfb576

25 Mar, 2022 1 commit
- Add bottleneck block · 3ade5b26
  Thor Johnsen authored Mar 24, 2022
  
  3ade5b26
24 Mar, 2022 1 commit

Add CUDA Focal Loss Implementation (#1337) · 28f8539c

Masaki Kozuki authored Mar 24, 2022



Take-over of #1097

* Add fast CUDA focal loss implementation.

* Enable fast math for CUDA focal loss.

* Correct typo.

* replace deprecated macros

* Add fast CUDA focal loss implementation.

* Enable fast math for CUDA focal loss.

* Correct typo.

* replace deprecated macros

* TORCH_CUDA_CHECK -> AT_CUDA_CHECK

The former is defined in torch/csrc/profiler/cuda.cpp so it's not available usually.
The latter however is defined in ATen/cuda/Exceptions.h as an alias of C10_CUDA_CHECK.

* add test

* clean up

* guard for torchvision
Co-authored-by: Wil Kong <alpha0422@gmail.com>

28f8539c

23 Mar, 2022 1 commit
- Peer memory halo exchange · 40a0e025
  Thor Johnsen authored Mar 22, 2022
  
  40a0e025
11 Mar, 2022 1 commit
- Updated the handling of CUDAGeneratorImpl.h to new path · 7bef81f7
  Pruthvi Madugundu authored Mar 11, 2022
  
  7bef81f7
27 Feb, 2022 1 commit
- build fused grad accum w/ wgrad only if cuda>10 (#1312) · 47c269b6
  Masaki Kozuki authored Feb 26, 2022
  
  47c269b6
26 Feb, 2022 1 commit

[transformer] Fuse grad accumulation with wgrad (#1297) · ddc08039

Masaki Kozuki authored Feb 25, 2022



* fuse grad accumulation w/ weight grad
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* fp32 training path

* not using *args, **kwargs

* backward: moved the tensor dimension cnversion
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* move files to csrc/megatron

* fix fp32 path

* fix typo

* add  to  in order to select the correct custom extension

* fix typo

* comment on import guard

* update test: enable gradient_accumulation_fusion

* 86

* remove redundant call of `test_column_parallel_linear`
Co-authored-by: Sangkug Lym <slym@nvidia.com>

ddc08039

10 Feb, 2022 1 commit
- 8.6 requires CUDA 11.1 (#1289) · e1aa1fc1
  Masaki Kozuki authored Feb 10, 2022
  
  e1aa1fc1
01 Feb, 2022 1 commit

Add the permutation related support as the extension for asp lib. (#1194) · 89edb819

ChongyuNVIDIA authored Feb 02, 2022

* Add the permutation related support as the extension for asp lib.

* [Fix] Track the permutation sequence for progressive channel swap strategy.

* Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings.

* Fix the deprecated functions in ASP unit tests.

* Fix the sparsity info typo in ASP lib.

* [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search.

* Update the README.md with identical random seed setting and NeurIPS info.

* Integrate the Pybind11 enhancement of permutation search into ASP lib.

89edb819

28 Jan, 2022 1 commit
- Cherry-pick b2fdf9c4 from upstream Apex and resolve conflicts (#68) · 5de49cc9
  Jithun Nair authored Jan 28, 2022
  
  5de49cc9
19 Jan, 2022 1 commit
- pass flags to transducer joint kernel (#1273) · c4e85f7b
  Masaki Kozuki authored Jan 18, 2022
  
  c4e85f7b
13 Jan, 2022 1 commit
- support new path to CUDAGeneratorImpl.h (#1267) · b2fdf9c4
  Shintaro Iwasaki authored Jan 13, 2022
  
  b2fdf9c4
16 Dec, 2021 1 commit
- version guard (#1253) · e8473822
  Masaki Kozuki authored Dec 16, 2021
  
  e8473822
15 Dec, 2021 1 commit
- Add `--threads 4` to `extra_compile_args["nvcc"]` (#1251) · f63dac80
  Masaki Kozuki authored Dec 15, 2021
```
* apply formatter & remove duplicate func def

* dry CUDA_HOME None check

* `--threads 4`
```
  f63dac80
14 Dec, 2021 1 commit

Faster `--fast_multihead_attn` build (#1245) · 7ec8ed67

Masaki Kozuki authored Dec 14, 2021

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

7ec8ed67

09 Dec, 2021 2 commits

Add fused mixed precision lamb optimizer. (#1237) · d11ddccf

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

d11ddccf

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

03 Dec, 2021 1 commit
- Add IS_ROCM_PYTORCH if statement for some newly-added extensions · 39a65c92
  hubertlu-tw authored Dec 03, 2021
  
  39a65c92
02 Dec, 2021 1 commit

Enable all supported CUDA extensions using --cuda_ext flag (#59) · 1e0f9bc6

Jithun Nair authored Dec 02, 2021

* Use --cuda_ext flag to build all supported extensions

* Don't remove --cuda_ext since it'll be needed to build other extensions

* Need to clear all cmdline args so setup.py doesn't complain

1e0f9bc6

02 Nov, 2021 1 commit
- Update setup.py · 62f06964
  Hubert Lu authored Nov 02, 2021
```
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
```
  62f06964
27 Oct, 2021 1 commit

`FastLayerNorm` compat with `autocast` (#1203) · ae757634

Masaki Kozuki authored Oct 27, 2021



* Persistent LayerNorm: Multi-CTA Rewrite

* autocast support
Co-authored-by: Young-Jun Ko <youngjun.ko@gmail.com>

ae757634

21 Oct, 2021 1 commit
- updates to MHA, compilation still broken · 88eee5fe
  Jeff Daily authored Oct 21, 2021
  
  88eee5fe
19 Oct, 2021 2 commits
- scaled_upper_triang_masked_softmax_cuda and scaled_masked_softmax_cuda in --cuda_ext are skipped · 203e3231
  Hubert Lu authored Oct 19, 2021
  
  203e3231
- Enable the following modules in apex/contrib: · 1fd257e2
  Abhishree authored Oct 19, 2021
```
1) multihead_attn
2) xentropy
3) fused_adam and distributed_fused_adam
```
  1fd257e2
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

08 Sep, 2021 1 commit

enable ninja (#1164) · 9ce0a10f

Masaki Kozuki authored Sep 08, 2021

- passing include directories to `CUDAExtension`'s `include_dirs` argument
- removing `-I/path/to/dir` arguments from `extra_compile_args`

9ce0a10f

07 Sep, 2021 1 commit

Enable group batch norm (--bnp) on ROCm (only bn_group = 1) (#51) · e57c84e0

sarunyap authored Sep 07, 2021

* Enable group batch norm (--bnp) on ROCm (only bn_group = 1)

Enable NHWC group batch norm on a single GPU on ROCm (bn_group = 1).
The multi-GPU case (bn_group > 1) will be revisited in the future.

The following are the main changes:

1) Use MIOpen data structures/functions in HIP instead of CUDNN
2) For the warp-level primitive code, we ensure that the code operates
   on 64-thread wide warp instead of 32-thread wide
3) Disable all the bn_group > 1 paths

Notes:

1) Multi-stream is not tested.
2) We have not optimized for performance

* Fix bnp hipification

Avoid calling hipify-perl in setup.py and rely on PyTorch's internal
hipification mechanism.

* Make bnp data pointers contiguous

The contrib group batch norm implementation assumes that all input
tensors are contiguous.  When non-contiguous tensors are passed to the
function, it gives a wrong result.  This commit explicitly calls
.contiguous() to make all input tensors contiguous before accessing
them.

* Fix HIP lane id in bnp

Fix typo

* Fix ReLU bitmask for HIP in bnp

The ReLU bitmask is derived by using the __ballot function which returns
a 64-bit value in HIP.  This commit fixes the ReLU bitmask storage size
and offsets on ROCm.

This patch also fixes the kernel to set ReLU bitmask to 1 when the data
is less than or equal to zero (not only less than).  Not doing so can
cause a stability issue.

* Remove multiple of 64 offset for HIP in bnp

The multiple of 64 offset is not necessary.

* Use FP16 intermediate output to determine whether to rectify in bnp

Group batch norm takes FP16 tensors and produces the FP16 output,
however, all arithmetic operations are done in FP32, thus intermediate
outputs are in FP32.  For the fusion kernels, ReLU determines the FP32
intermediate output to decide whether to rectify it.  ReLU must rectify
the intermediate output if it is less than or "equal" to zero.  There is
a chance that the intermediate FP32 output is very close to zero, and
when it is converted to FP16, it becomes zero.  In this case, this
output is not rectified when it should be.  Since the output is not
rectified in the forward pass, the gradient is not rectified in the
backward pass.  This can cause a stability issue.

This patch can have a negative impact on the performance of group batch
norm as we perform FP32-FP16 conversion multiple times.

* Disable dispatchX ParallelSums in HIP in bnp

dispatchX is not required for the bn_group = 1 case.

* Use traditional load/store for HIP in bnp

The built-in function has a high floating point rounding error.  Thus,
we replace it with the traditional load/store.  Doing so breaks the
aligned pointer property in the load/store functions.  We conservatively
use traditional load/store for all memory access.

* Replace shfl_down with shfl_sync in parallel sums for HIP in bnp

This commit separates the HIP code from the CUDA code in parallel sums

* Remove -U__HIP_NO_HALF_CONVERSIONS__ for HIP in bnp

Since the built-in function is removed, -U__HIP_NO_HALF_CONVERSIONS__ is
no longer needed.

* Preserve CUDA's ReLU condition path for USE_ADD_RELU in bnp

* Add test for bnp

The test evaluates correctness of batch norm, batch norm + ReLU, and
batch norm + add + ReLU against the reference implementation.

For the forward activation output, we validate it against the PyTorch's
implementation.  The group batch norm activation output must be allclose
with the PyTorch activation output for the test to pass.

For the backward gradient output, we validate it against the Python
implementation.  Due to the floating point rounding error in the batch
norm implementation, the group batch norm gradient output might not be
allclose with the Python implementation output when ReLU is being used
although the majority of the elements are very close to each other.
Thus, we use the norm difference threshold to determine whether the test
is passed or failed instead of allclose.

* Use the warp size variable than hard coding the warp size in bnp

Use C10_WARP_SIZE from c10/macros/Macros.h in the host functions and use
warpSize in the device kernels instead of hard coding the warp size.

e57c84e0

01 Sep, 2021 3 commits

work around hipify not finding headers · 888e72ad
Jeff Daily authored Sep 01, 2021

888e72ad

Seryilmaz/fuse norm into scale (#1149) · 4d190db6

Burc Eryilmaz authored Sep 01, 2021



* fuse norm into scale

* add fused norm into dlamb
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

4d190db6

Seryilmaz/more cublas lt (#1147) · 6af09dd9

Burc Eryilmaz authored Aug 31, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6af09dd9