1. 13 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Cherry-picked the commit from upstream for faster --fast_multihead_attn build (#76) · 29b36315
      Hubert Lu authored
      
      
      * Faster `--fast_multihead_attn` build (#1245)
      
      * merge .so files
      
      * odr
      
      * fix build
      
      * update import
      
      * apply psf/black with max line length of 120
      
      * update
      
      * fix
      
      * update
      
      * build fixed again but undefined symbol again
      
      * fix 2, still layer norm grad is undefined
      
      * remove unused cpp files
      
      * without layer_norm.cuh, import works
      
      * import fast_multihead_attn works...
      
      but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
      causing .shared objects not to be able to link `HostApplyLayerNorm` and
      `HostLayerNormGradient`?
      
      * clean up layer norm
      
      * Fix some bugs
      Co-authored-by: default avatarMasaki Kozuki <mkozuki@nvidia.com>
      29b36315
  2. 06 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142
      Hubert Lu authored
      Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)
      
      * First attempt to make rocblas flag backward compatible
      
      * Fix some bugs
      
      * Fix some bugs
      
      * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions
      
      * Add groupbn extension unit tests for ROCm
      
      * Fix some bugs
      5ecad142
  3. 11 Mar, 2022 1 commit
  4. 28 Jan, 2022 1 commit
  5. 09 Dec, 2021 1 commit
  6. 03 Dec, 2021 1 commit
  7. 02 Dec, 2021 1 commit
  8. 02 Nov, 2021 1 commit
  9. 27 Oct, 2021 1 commit
  10. 21 Oct, 2021 1 commit
  11. 19 Oct, 2021 2 commits
  12. 02 Oct, 2021 1 commit
  13. 08 Sep, 2021 1 commit
    • Masaki Kozuki's avatar
      enable ninja (#1164) · 9ce0a10f
      Masaki Kozuki authored
      - passing include directories to `CUDAExtension`'s `include_dirs` argument
      - removing `-I/path/to/dir` arguments from `extra_compile_args`
      9ce0a10f
  14. 07 Sep, 2021 1 commit
    • sarunyap's avatar
      Enable group batch norm (--bnp) on ROCm (only bn_group = 1) (#51) · e57c84e0
      sarunyap authored
      * Enable group batch norm (--bnp) on ROCm (only bn_group = 1)
      
      Enable NHWC group batch norm on a single GPU on ROCm (bn_group = 1).
      The multi-GPU case (bn_group > 1) will be revisited in the future.
      
      The following are the main changes:
      
      1) Use MIOpen data structures/functions in HIP instead of CUDNN
      2) For the warp-level primitive code, we ensure that the code operates
         on 64-thread wide warp instead of 32-thread wide
      3) Disable all the bn_group > 1 paths
      
      Notes:
      
      1) Multi-stream is not tested.
      2) We have not optimized for performance
      
      * Fix bnp hipification
      
      Avoid calling hipify-perl in setup.py and rely on PyTorch's internal
      hipification mechanism.
      
      * Make bnp data pointers contiguous
      
      The contrib group batch norm implementation assumes that all input
      tensors are contiguous.  When non-contiguous tensors are passed to the
      function, it gives a wrong result.  This commit explicitly calls
      .contiguous() to make all input tensors contiguous before accessing
      them.
      
      * Fix HIP lane id in bnp
      
      Fix typo
      
      * Fix ReLU bitmask for HIP in bnp
      
      The ReLU bitmask is derived by using the __ballot function which returns
      a 64-bit value in HIP.  This commit fixes the ReLU bitmask storage size
      and offsets on ROCm.
      
      This patch also fixes the kernel to set ReLU bitmask to 1 when the data
      is less than or equal to zero (not only less than).  Not doing so can
      cause a stability issue.
      
      * Remove multiple of 64 offset for HIP in bnp
      
      The multiple of 64 offset is not necessary.
      
      * Use FP16 intermediate output to determine whether to rectify in bnp
      
      Group batch norm takes FP16 tensors and produces the FP16 output,
      however, all arithmetic operations are done in FP32, thus intermediate
      outputs are in FP32.  For the fusion kernels, ReLU determines the FP32
      intermediate output to decide whether to rectify it.  ReLU must rectify
      the intermediate output if it is less than or "equal" to zero.  There is
      a chance that the intermediate FP32 output is very close to zero, and
      when it is converted to FP16, it becomes zero.  In this case, this
      output is not rectified when it should be.  Since the output is not
      rectified in the forward pass, the gradient is not rectified in the
      backward pass.  This can cause a stability issue.
      
      This patch can have a negative impact on the performance of group batch
      norm as we perform FP32-FP16 conversion multiple times.
      
      * Disable dispatchX ParallelSums in HIP in bnp
      
      dispatchX is not required for the bn_group = 1 case.
      
      * Use traditional load/store for HIP in bnp
      
      The built-in function has a high floating point rounding error.  Thus,
      we replace it with the traditional load/store.  Doing so breaks the
      aligned pointer property in the load/store functions.  We conservatively
      use traditional load/store for all memory access.
      
      * Replace shfl_down with shfl_sync in parallel sums for HIP in bnp
      
      This commit separates the HIP code from the CUDA code in parallel sums
      
      * Remove -U__HIP_NO_HALF_CONVERSIONS__ for HIP in bnp
      
      Since the built-in function is removed, -U__HIP_NO_HALF_CONVERSIONS__ is
      no longer needed.
      
      * Preserve CUDA's ReLU condition path for USE_ADD_RELU in bnp
      
      * Add test for bnp
      
      The test evaluates correctness of batch norm, batch norm + ReLU, and
      batch norm + add + ReLU against the reference implementation.
      
      For the forward activation output, we validate it against the PyTorch's
      implementation.  The group batch norm activation output must be allclose
      with the PyTorch activation output for the test to pass.
      
      For the backward gradient output, we validate it against the Python
      implementation.  Due to the floating point rounding error in the batch
      norm implementation, the group batch norm gradient output might not be
      allclose with the Python implementation output when ReLU is being used
      although the majority of the elements are very close to each other.
      Thus, we use the norm difference threshold to determine whether the test
      is passed or failed instead of allclose.
      
      * Use the warp size variable than hard coding the warp size in bnp
      
      Use C10_WARP_SIZE from c10/macros/Macros.h in the host functions and use
      warpSize in the device kernels instead of hard coding the warp size.
      e57c84e0
  15. 01 Sep, 2021 3 commits
  16. 31 Aug, 2021 1 commit
  17. 17 Jul, 2021 2 commits
    • Nan Zheng's avatar
      Added more fusion and vectorized kernel for transducer (#1125) · 0c2c6eea
      Nan Zheng authored
      * Added support for fused ReLU and dropout into transducer joint
      
      * Reorganized code selection path in transducer joint fwd
      * Added support for fused ReLU+dropout into transducer joint
      
      * Vectorize transducer loss backward with fused softmax (#3)
      
      * Nanz/transducer loss (#4)
      
      * Vectorize transducer loss backward with fused softmax
      
      * Added a predicate to avoid potential IMA
      
      * Nanz/transducer loss (#5)
      
      * Vectorize transducer loss backward with fused softmax
      
      * Added a predicate to avoid potentional IMA
      
      * Added more predicates to avoid IMAs
      
      * Updated documentations for newly added features.
      
      * Fixed a error in transducer.py
      0c2c6eea
    • yjk21's avatar
      Adds small-batch kernels (#1126) · ed719967
      yjk21 authored
      ed719967
  18. 25 Jun, 2021 1 commit
  19. 17 Apr, 2021 1 commit
  20. 16 Apr, 2021 1 commit
  21. 24 Mar, 2021 1 commit
    • Nan Zheng's avatar
      Initial check-in of the transducer extensions (#1069) · d86d1b09
      Nan Zheng authored
      * Initial check-in of the transducer extension.
      
      * Added more comments to help explain the code
      
      * Corrected minor typos
      
      * 1. Renamed variable in tests to match the extension
      2. Disabled ninja build option
      d86d1b09
  22. 23 Feb, 2021 1 commit
  23. 21 Jan, 2021 1 commit
  24. 18 Jan, 2021 1 commit
  25. 16 Dec, 2020 1 commit
  26. 15 Dec, 2020 3 commits
  27. 10 Dec, 2020 1 commit
  28. 09 Dec, 2020 2 commits
  29. 01 Dec, 2020 1 commit
  30. 18 Aug, 2020 1 commit
  31. 17 Aug, 2020 1 commit
  32. 10 Aug, 2020 1 commit
  33. 05 Aug, 2020 1 commit