- 23 Aug, 2022 1 commit
-
-
hanbao authored
Co-authored-by:Han Bao <hbao@nvidia.com>
-
- 07 Jul, 2022 1 commit
-
-
Masaki Kozuki authored
* remove pyprof * remove reparameterization * remove pyprof test * clean up
-
- 31 May, 2022 1 commit
-
-
Hubert Lu authored
* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming * Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement
-
- 21 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* guard * update * remove unnecessary version guard * runtime version guard * cosmetic * skip tests appropriately
-
- 19 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* bump version * add guard * fix the cond
-
- 15 Apr, 2022 1 commit
-
-
Hubert Lu authored
* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda * Comment out CUDA-specific implementations * Resolve filename collision of *cpp files with to-hipify code and *cu files
-
- 13 Apr, 2022 1 commit
-
-
Hubert Lu authored
* Faster `--fast_multihead_attn` build (#1245) * merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm * Fix some bugs Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
- 06 Apr, 2022 1 commit
-
-
Hubert Lu authored
Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74) * First attempt to make rocblas flag backward compatible * Fix some bugs * Fix some bugs * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions * Add groupbn extension unit tests for ROCm * Fix some bugs
-
- 05 Apr, 2022 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 30 Mar, 2022 1 commit
-
-
Gil Shomron authored
* Enabled Conv-Bias-ReLU fusion The following modules are enabled using cuDNN runtime fusion: 1) Conv-Bias-ReLU (+backward) 2) Conv-Bias (+backward) 3) Conv-Bias-Mask-ReLU (+backward) * Casts cleanup and autocast in unittest - Remove redundant dtype casts - Simulate the usage in the unittest by using torch.cuda.amp.autocast Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com> * Fixed save_for_backward Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by:
root <root@luna-0277.selene.nvidia.com>
-
- 25 Mar, 2022 1 commit
-
-
Thor Johnsen authored
-
- 24 Mar, 2022 1 commit
-
-
Masaki Kozuki authored
Take-over of #1097 * Add fast CUDA focal loss implementation. * Enable fast math for CUDA focal loss. * Correct typo. * replace deprecated macros * Add fast CUDA focal loss implementation. * Enable fast math for CUDA focal loss. * Correct typo. * replace deprecated macros * TORCH_CUDA_CHECK -> AT_CUDA_CHECK The former is defined in torch/csrc/profiler/cuda.cpp so it's not available usually. The latter however is defined in ATen/cuda/Exceptions.h as an alias of C10_CUDA_CHECK. * add test * clean up * guard for torchvision Co-authored-by:Wil Kong <alpha0422@gmail.com>
-
- 23 Mar, 2022 1 commit
-
-
Thor Johnsen authored
-
- 11 Mar, 2022 1 commit
-
-
Pruthvi Madugundu authored
-
- 27 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 26 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
* fuse grad accumulation w/ weight grad Co-authored-by:
Sangkug Lym <slym@nvidia.com> * fp32 training path * not using *args, **kwargs * backward: moved the tensor dimension cnversion Co-authored-by:
Sangkug Lym <slym@nvidia.com> * move files to csrc/megatron * fix fp32 path * fix typo * add to in order to select the correct custom extension * fix typo * comment on import guard * update test: enable gradient_accumulation_fusion * 86 * remove redundant call of `test_column_parallel_linear` Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 10 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 01 Feb, 2022 1 commit
-
-
ChongyuNVIDIA authored
* Add the permutation related support as the extension for asp lib. * [Fix] Track the permutation sequence for progressive channel swap strategy. * Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings. * Fix the deprecated functions in ASP unit tests. * Fix the sparsity info typo in ASP lib. * [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search. * Update the README.md with identical random seed setting and NeurIPS info. * Integrate the Pybind11 enhancement of permutation search into ASP lib.
-
- 28 Jan, 2022 1 commit
-
-
Jithun Nair authored
-
- 19 Jan, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 13 Jan, 2022 1 commit
-
-
Shintaro Iwasaki authored
-
- 16 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
-
- 15 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
* apply formatter & remove duplicate func def * dry CUDA_HOME None check * `--threads 4`
-
- 14 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
* merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm
-
- 09 Dec, 2021 2 commits
-
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
- 03 Dec, 2021 1 commit
-
-
hubertlu-tw authored
-
- 02 Dec, 2021 1 commit
-
-
Jithun Nair authored
* Use --cuda_ext flag to build all supported extensions * Don't remove --cuda_ext since it'll be needed to build other extensions * Need to clear all cmdline args so setup.py doesn't complain
-
- 02 Nov, 2021 1 commit
-
-
Hubert Lu authored
Co-authored-by:Jeff Daily <jeff.daily@amd.com>
-
- 27 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* Persistent LayerNorm: Multi-CTA Rewrite * autocast support Co-authored-by:Young-Jun Ko <youngjun.ko@gmail.com>
-
- 21 Oct, 2021 1 commit
-
-
Jeff Daily authored
-
- 19 Oct, 2021 2 commits
- 02 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
Co-authored-by:
Piotr Bialecki <pbialecki@nvidia.com> Co-authored-by:
Eddie Yan <eddiey@nvidia.com> Co-authored-by:
Rishi Puri <riship@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 08 Sep, 2021 1 commit
-
-
Masaki Kozuki authored
- passing include directories to `CUDAExtension`'s `include_dirs` argument - removing `-I/path/to/dir` arguments from `extra_compile_args`
-
- 07 Sep, 2021 1 commit
-
-
sarunyap authored
* Enable group batch norm (--bnp) on ROCm (only bn_group = 1) Enable NHWC group batch norm on a single GPU on ROCm (bn_group = 1). The multi-GPU case (bn_group > 1) will be revisited in the future. The following are the main changes: 1) Use MIOpen data structures/functions in HIP instead of CUDNN 2) For the warp-level primitive code, we ensure that the code operates on 64-thread wide warp instead of 32-thread wide 3) Disable all the bn_group > 1 paths Notes: 1) Multi-stream is not tested. 2) We have not optimized for performance * Fix bnp hipification Avoid calling hipify-perl in setup.py and rely on PyTorch's internal hipification mechanism. * Make bnp data pointers contiguous The contrib group batch norm implementation assumes that all input tensors are contiguous. When non-contiguous tensors are passed to the function, it gives a wrong result. This commit explicitly calls .contiguous() to make all input tensors contiguous before accessing them. * Fix HIP lane id in bnp Fix typo * Fix ReLU bitmask for HIP in bnp The ReLU bitmask is derived by using the __ballot function which returns a 64-bit value in HIP. This commit fixes the ReLU bitmask storage size and offsets on ROCm. This patch also fixes the kernel to set ReLU bitmask to 1 when the data is less than or equal to zero (not only less than). Not doing so can cause a stability issue. * Remove multiple of 64 offset for HIP in bnp The multiple of 64 offset is not necessary. * Use FP16 intermediate output to determine whether to rectify in bnp Group batch norm takes FP16 tensors and produces the FP16 output, however, all arithmetic operations are done in FP32, thus intermediate outputs are in FP32. For the fusion kernels, ReLU determines the FP32 intermediate output to decide whether to rectify it. ReLU must rectify the intermediate output if it is less than or "equal" to zero. There is a chance that the intermediate FP32 output is very close to zero, and when it is converted to FP16, it becomes zero. In this case, this output is not rectified when it should be. Since the output is not rectified in the forward pass, the gradient is not rectified in the backward pass. This can cause a stability issue. This patch can have a negative impact on the performance of group batch norm as we perform FP32-FP16 conversion multiple times. * Disable dispatchX ParallelSums in HIP in bnp dispatchX is not required for the bn_group = 1 case. * Use traditional load/store for HIP in bnp The built-in function has a high floating point rounding error. Thus, we replace it with the traditional load/store. Doing so breaks the aligned pointer property in the load/store functions. We conservatively use traditional load/store for all memory access. * Replace shfl_down with shfl_sync in parallel sums for HIP in bnp This commit separates the HIP code from the CUDA code in parallel sums * Remove -U__HIP_NO_HALF_CONVERSIONS__ for HIP in bnp Since the built-in function is removed, -U__HIP_NO_HALF_CONVERSIONS__ is no longer needed. * Preserve CUDA's ReLU condition path for USE_ADD_RELU in bnp * Add test for bnp The test evaluates correctness of batch norm, batch norm + ReLU, and batch norm + add + ReLU against the reference implementation. For the forward activation output, we validate it against the PyTorch's implementation. The group batch norm activation output must be allclose with the PyTorch activation output for the test to pass. For the backward gradient output, we validate it against the Python implementation. Due to the floating point rounding error in the batch norm implementation, the group batch norm gradient output might not be allclose with the Python implementation output when ReLU is being used although the majority of the elements are very close to each other. Thus, we use the norm difference threshold to determine whether the test is passed or failed instead of allclose. * Use the warp size variable than hard coding the warp size in bnp Use C10_WARP_SIZE from c10/macros/Macros.h in the host functions and use warpSize in the device kernels instead of hard coding the warp size.
-
- 01 Sep, 2021 3 commits
-
-
Jeff Daily authored
-
Burc Eryilmaz authored
* fuse norm into scale * add fused norm into dlamb Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
Burc Eryilmaz authored
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-