- 08 Aug, 2022 2 commits
-
-
hubertlu-tw authored
-
Hubert Lu authored
* Skip the failing unit tests from the FusedRMSNorm PR * Update test_lamb.py Co-authored-by:Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
-
- 05 Aug, 2022 1 commit
-
-
Hubert Lu authored
* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) * FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> * fix and generate docs for FusedRMSNorm (#1285) * [FusedRMSNorm doc] document where epsilon is added (#1295) * [FusedRMSNorm doc] add epsilon to formula * correct * better wording * Fix some bugs * Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs * Fix NaN issues in FusedRMSNorm * Update test_fused_layer_norm.py * Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm * Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize Co-authored-by:
eqy <eddiey@nvidia.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Stas Bekman <stas00@users.noreply.github.com>
-
- 31 May, 2022 1 commit
-
-
Hubert Lu authored
* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming * Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement
-
- 15 Apr, 2022 1 commit
-
-
Hubert Lu authored
* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda * Comment out CUDA-specific implementations * Resolve filename collision of *cpp files with to-hipify code and *cu files
-
- 14 Apr, 2022 1 commit
-
-
mahathis authored
* Added suuport for memory format API(torch.channels_last) in GBN Group Batch Norm (GBN) is an NHWC operation. It assumes that the underlying memory format of an input tensor is NHWC. It originally does not support PyTorch's memory_format API. To support PyTorch's memory_format API, i.e., .to(memory_format=...) or .contiguous(memory_format=...), we add the torch_channels_last flag to indicate whether the workload adopts the PyTorch memory_format API by setting memory_format=torch.channels_last. This flag allows GBN to handle memory formats of input tensors properly. An example to use memory_format in GBN: """ from apex.contrib.groupbn.batch_norm import BatchNorm2d_NHWC GBN = BatchNorm2d_NHWC(planes, fuse_relu=True, bn_group=1, torch_channels_last=True) """ The cases that GBN handles are as follows: 1. torch_channels_last=True and input tensor's memory_format=torch.channels_last, GBN will generate the torch.channels_last output tensor. 2. torch_channels_last=True and input tensor's memory_format=torch.contiguous_format, GBN will convert the input tensor to torch.channels_last and will generate the torch.channels_last output tensor. 3. use_pytorch_channels_last=False and input tensor's memory_format=torch.contiguous_format, GBN will generate the torch.contiguous_format output tensor. * Add GBN unit tests for channel_last memory format Co-authored-by:hubertlu-tw <hubertlu@amd.com>
-
- 13 Apr, 2022 1 commit
-
-
Hubert Lu authored
* Faster `--fast_multihead_attn` build (#1245) * merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm * Fix some bugs Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
- 06 Apr, 2022 1 commit
-
-
Hubert Lu authored
Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74) * First attempt to make rocblas flag backward compatible * Fix some bugs * Fix some bugs * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions * Add groupbn extension unit tests for ROCm * Fix some bugs
-
- 23 Mar, 2022 1 commit
-
-
Hubert Lu authored
* Add rocblas_alt_impl flag in MLP * Refactor rocblas_alt_impl implementation and only use it for backprop
-
- 18 Mar, 2022 1 commit
-
-
athitten authored
* Add missing flags arg in gemm_switch_fp32accum call * Add rocblas_alt_impl flag in MHA <rev> Add rocblas_alt_impl flag for all bwd gemms in MHA module * Use ifdef for rocblas_gemm_flags_fp16_alt_impl to target at various AMD hardware Co-authored-by:hubertlu-tw <hubertlu@amd.com>
-
- 11 Mar, 2022 1 commit
-
-
Pruthvi Madugundu authored
-
- 16 Feb, 2022 1 commit
-
-
hubertlu-tw authored
-
- 28 Jan, 2022 1 commit
-
-
Jithun Nair authored
-
- 26 Jan, 2022 1 commit
-
-
Jithun Nair authored
-
- 25 Jan, 2022 2 commits
- 21 Jan, 2022 1 commit
-
-
athitten authored
Removing debug print statement that is not necessary.
-
- 14 Dec, 2021 3 commits
-
-
Jithun Nair authored
IFU-master-2021-12-08
-
Hubert Lu authored
-
Hubert Lu authored
* Skip failing unit tests * Modify the test skipping messages
-
- 13 Dec, 2021 1 commit
-
-
Hubert Lu authored
-
- 09 Dec, 2021 5 commits
-
-
Hubert Lu authored
-
Masaki Kozuki authored
* pass `self.mask_additive` * clang-format * removing THCState
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
Hubert Lu authored
-
hubertlu-tw authored
-
- 08 Dec, 2021 1 commit
-
-
Jithun Nair authored
IFU-2021-10-15 (+ remove redundant defines + C10_CUDA_CHECK)
-
- 06 Dec, 2021 2 commits
-
-
Hubert Lu authored
-
Masaki Kozuki authored
Changes include - THC headers removal - TH macros replacement - fix some typo in comment Conflicts: apex/contrib/csrc/multihead_attn/additive_masked_softmax_dropout_cuda.cu apex/contrib/csrc/multihead_attn/encdec_multihead_attn_cuda.cu apex/contrib/csrc/multihead_attn/encdec_multihead_attn_norm_add_cuda.cu apex/contrib/csrc/multihead_attn/masked_softmax_dropout_cuda.cu apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_additive_mask_cuda.cu apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_cuda.cu apex/contrib/csrc/multihead_attn/self_multihead_attn_cuda.cu apex/contrib/csrc/multihead_attn/self_multihead_attn_norm_add_cuda.cu apex/contrib/csrc/multihead_attn/strided_batched_gemm.h
-
- 03 Dec, 2021 2 commits
-
-
hubertlu-tw authored
-
hubertlu-tw authored
-
- 02 Dec, 2021 4 commits
-
-
Jithun Nair authored
* Use --cuda_ext flag to build all supported extensions * Don't remove --cuda_ext since it'll be needed to build other extensions * Need to clear all cmdline args so setup.py doesn't complain
-
Hubert Lu authored
Add more unit tests for both distributed and extensions
-
hubertlu-tw authored
-
Hubert Lu authored
-
- 01 Dec, 2021 2 commits
- 29 Nov, 2021 1 commit
-
-
X Wang authored
-
- 22 Nov, 2021 1 commit
-
-
Hubert Lu authored
Change python3.6 to python
-
- 19 Nov, 2021 1 commit
-
-
Hubert Lu authored
-