1. 08 Aug, 2022 2 commits
  2. 05 Aug, 2022 1 commit
    • Hubert Lu's avatar
      Enable FusedRMSNorm (#78) · c97ebfab
      Hubert Lu authored
      
      
      * FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)
      
      * FusedRMSNorm based on FusedLayerNorm
      
      * refactor duplicated kernels
      
      * delete comments
      
      * delete comments
      
      * cleanup
      
      * cleanup
      
      * cleanup, fixed clobbering forward_affine_mixed_dtypes
      
      * fix pybind naming and add MixedFused test
      
      * undo skipping
      
      * check elementwise_affine
      
      * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py
      
      Oof, nice catch, thanks
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      
      * fix and generate docs for FusedRMSNorm (#1285)
      
      * [FusedRMSNorm doc] document where epsilon is added (#1295)
      
      * [FusedRMSNorm doc] add epsilon to formula
      
      * correct
      
      * better wording
      
      * Fix some bugs
      
      * Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs
      
      * Fix NaN issues in FusedRMSNorm
      
      * Update test_fused_layer_norm.py
      
      * Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm
      
      * Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
      Co-authored-by: default avatareqy <eddiey@nvidia.com>
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      Co-authored-by: default avatarStas Bekman <stas00@users.noreply.github.com>
      c97ebfab
  3. 31 May, 2022 1 commit
  4. 15 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Apex transformer (#77) · 27a47345
      Hubert Lu authored
      * Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda
      
      * Comment out CUDA-specific implementations
      
      * Resolve filename collision of *cpp files with to-hipify code and *cu files
      27a47345
  5. 14 Apr, 2022 1 commit
    • mahathis's avatar
      Added support for memory format API(torch.channels_last) in GBN (#72) · dd584a59
      mahathis authored
      
      
      * Added suuport for memory format API(torch.channels_last) in GBN
      
      Group Batch Norm (GBN) is an NHWC operation.  It assumes that the
      underlying memory format of an input tensor is NHWC.  It originally does
      not support PyTorch's memory_format API.
      
      To support PyTorch's memory_format API, i.e., .to(memory_format=...) or
      .contiguous(memory_format=...), we add the torch_channels_last
      flag to indicate whether the workload adopts the PyTorch memory_format
      API by setting memory_format=torch.channels_last.  This flag allows GBN
      to handle memory formats of input tensors properly.
      
      An example to use memory_format in GBN:
      
      """
      from apex.contrib.groupbn.batch_norm import BatchNorm2d_NHWC
      
      GBN = BatchNorm2d_NHWC(planes, fuse_relu=True, bn_group=1, torch_channels_last=True)
      
      """
      
      The cases that GBN handles are as follows:
      
      1. torch_channels_last=True and input tensor's
      memory_format=torch.channels_last, GBN will generate the
      torch.channels_last output tensor.
      
      2. torch_channels_last=True and input tensor's
      memory_format=torch.contiguous_format, GBN will convert the input tensor
      to torch.channels_last and will generate the torch.channels_last output
      tensor.
      
      3. use_pytorch_channels_last=False and input tensor's
      memory_format=torch.contiguous_format, GBN will generate the
      torch.contiguous_format output tensor.
      
      * Add GBN unit tests for channel_last memory format
      Co-authored-by: default avatarhubertlu-tw <hubertlu@amd.com>
      dd584a59
  6. 13 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Cherry-picked the commit from upstream for faster --fast_multihead_attn build (#76) · 29b36315
      Hubert Lu authored
      
      
      * Faster `--fast_multihead_attn` build (#1245)
      
      * merge .so files
      
      * odr
      
      * fix build
      
      * update import
      
      * apply psf/black with max line length of 120
      
      * update
      
      * fix
      
      * update
      
      * build fixed again but undefined symbol again
      
      * fix 2, still layer norm grad is undefined
      
      * remove unused cpp files
      
      * without layer_norm.cuh, import works
      
      * import fast_multihead_attn works...
      
      but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
      causing .shared objects not to be able to link `HostApplyLayerNorm` and
      `HostLayerNormGradient`?
      
      * clean up layer norm
      
      * Fix some bugs
      Co-authored-by: default avatarMasaki Kozuki <mkozuki@nvidia.com>
      29b36315
  7. 06 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142
      Hubert Lu authored
      Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)
      
      * First attempt to make rocblas flag backward compatible
      
      * Fix some bugs
      
      * Fix some bugs
      
      * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions
      
      * Add groupbn extension unit tests for ROCm
      
      * Fix some bugs
      5ecad142
  8. 23 Mar, 2022 1 commit
  9. 18 Mar, 2022 1 commit
  10. 11 Mar, 2022 1 commit
  11. 16 Feb, 2022 1 commit
  12. 28 Jan, 2022 1 commit
  13. 26 Jan, 2022 1 commit
  14. 25 Jan, 2022 2 commits
  15. 21 Jan, 2022 1 commit
  16. 14 Dec, 2021 3 commits
  17. 13 Dec, 2021 1 commit
  18. 09 Dec, 2021 5 commits
  19. 08 Dec, 2021 1 commit
  20. 06 Dec, 2021 2 commits
    • Hubert Lu's avatar
      Replace THCudaCheck with C10_CUDA_CHECK · fec3141c
      Hubert Lu authored
      fec3141c
    • Masaki Kozuki's avatar
      remove THC headers/functions (#1192) · 2155dabf
      Masaki Kozuki authored
      Changes include
      - THC headers removal
      - TH macros replacement
      - fix some typo in comment
       Conflicts:
      	apex/contrib/csrc/multihead_attn/additive_masked_softmax_dropout_cuda.cu
      	apex/contrib/csrc/multihead_attn/encdec_multihead_attn_cuda.cu
      	apex/contrib/csrc/multihead_attn/encdec_multihead_attn_norm_add_cuda.cu
      	apex/contrib/csrc/multihead_attn/masked_softmax_dropout_cuda.cu
      	apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_additive_mask_cuda.cu
      	apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_cuda.cu
      	apex/contrib/csrc/multihead_attn/self_multihead_attn_cuda.cu
      	apex/contrib/csrc/multihead_attn/self_multihead_attn_norm_add_cuda.cu
      	apex/contrib/csrc/multihead_attn/strided_batched_gemm.h
      2155dabf
  21. 03 Dec, 2021 2 commits
  22. 02 Dec, 2021 4 commits
  23. 01 Dec, 2021 2 commits
  24. 29 Nov, 2021 1 commit
  25. 22 Nov, 2021 1 commit
  26. 19 Nov, 2021 1 commit