1. 18 Sep, 2023 1 commit
  2. 12 Jun, 2023 1 commit
    • flyingdown's avatar
      1.修改了readme · f8b650c8
      flyingdown authored
      2.添加环境变量APEX_ROCBLAS_GEMM_ALLOW_HALF用于控制是否使用fp16r
      3.添加dcu版本信息
      
      whl包名修改
      
      readme更新安装步骤
      f8b650c8
  3. 23 Apr, 2023 1 commit
    • luise.chen's avatar
      Add FusedLARS optimizer (#109) · e519c1e3
      luise.chen authored
      * Add fused_lars optimizer
      
      * Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW
      
      * Add flow of using nesterov in FusedLARS
      e519c1e3
  4. 23 Mar, 2023 1 commit
    • luise.chen's avatar
      Add FusedLARS optimizer (#109) · 7a428776
      luise.chen authored
      * Add fused_lars optimizer
      
      * Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW
      
      * Add flow of using nesterov in FusedLARS
      7a428776
  5. 11 Nov, 2022 1 commit
  6. 08 Nov, 2022 2 commits
  7. 21 Sep, 2022 1 commit
  8. 19 Sep, 2022 1 commit
    • Hubert Lu's avatar
      Faster build (#95) · 89f5722c
      Hubert Lu authored
      * Remove redundant import's and enable ninja for MHA extension
      
      * Remove redundant CUDAExtension import's
      89f5722c
  9. 08 Sep, 2022 1 commit
    • Hubert Lu's avatar
      Enable --transducer extension for ROCm (#88) · ae5ca671
      Hubert Lu authored
      * Enable --transducer extension for ROCm
      
      * Enable --transducer unit tests for ROCm
      
      * Skip some failing tests in test_transducer_joint.py
      
      * Skip test_transducer_joint_pack for transducer extension
      
      * Keep transducer extension CUDA-compatible
      ae5ca671
  10. 07 Sep, 2022 1 commit
  11. 23 Aug, 2022 2 commits
  12. 22 Aug, 2022 1 commit
  13. 07 Jul, 2022 1 commit
  14. 31 May, 2022 1 commit
  15. 21 Apr, 2022 1 commit
  16. 19 Apr, 2022 1 commit
  17. 15 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Apex transformer (#77) · 27a47345
      Hubert Lu authored
      * Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda
      
      * Comment out CUDA-specific implementations
      
      * Resolve filename collision of *cpp files with to-hipify code and *cu files
      27a47345
  18. 13 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Cherry-picked the commit from upstream for faster --fast_multihead_attn build (#76) · 29b36315
      Hubert Lu authored
      
      
      * Faster `--fast_multihead_attn` build (#1245)
      
      * merge .so files
      
      * odr
      
      * fix build
      
      * update import
      
      * apply psf/black with max line length of 120
      
      * update
      
      * fix
      
      * update
      
      * build fixed again but undefined symbol again
      
      * fix 2, still layer norm grad is undefined
      
      * remove unused cpp files
      
      * without layer_norm.cuh, import works
      
      * import fast_multihead_attn works...
      
      but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
      causing .shared objects not to be able to link `HostApplyLayerNorm` and
      `HostLayerNormGradient`?
      
      * clean up layer norm
      
      * Fix some bugs
      Co-authored-by: default avatarMasaki Kozuki <mkozuki@nvidia.com>
      29b36315
  19. 06 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142
      Hubert Lu authored
      Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)
      
      * First attempt to make rocblas flag backward compatible
      
      * Fix some bugs
      
      * Fix some bugs
      
      * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions
      
      * Add groupbn extension unit tests for ROCm
      
      * Fix some bugs
      5ecad142
  20. 05 Apr, 2022 2 commits
  21. 30 Mar, 2022 1 commit
  22. 25 Mar, 2022 1 commit
  23. 24 Mar, 2022 1 commit
    • Masaki Kozuki's avatar
      Add CUDA Focal Loss Implementation (#1337) · 28f8539c
      Masaki Kozuki authored
      
      
      Take-over of #1097
      
      * Add fast CUDA focal loss implementation.
      
      * Enable fast math for CUDA focal loss.
      
      * Correct typo.
      
      * replace deprecated macros
      
      * Add fast CUDA focal loss implementation.
      
      * Enable fast math for CUDA focal loss.
      
      * Correct typo.
      
      * replace deprecated macros
      
      * TORCH_CUDA_CHECK -> AT_CUDA_CHECK
      
      The former is defined in torch/csrc/profiler/cuda.cpp so it's not available usually.
      The latter however is defined in ATen/cuda/Exceptions.h as an alias of C10_CUDA_CHECK.
      
      * add test
      
      * clean up
      
      * guard for torchvision
      Co-authored-by: default avatarWil Kong <alpha0422@gmail.com>
      28f8539c
  24. 23 Mar, 2022 1 commit
  25. 11 Mar, 2022 1 commit
  26. 27 Feb, 2022 1 commit
  27. 26 Feb, 2022 1 commit
  28. 10 Feb, 2022 1 commit
  29. 01 Feb, 2022 1 commit
    • ChongyuNVIDIA's avatar
      Add the permutation related support as the extension for asp lib. (#1194) · 89edb819
      ChongyuNVIDIA authored
      * Add the permutation related support as the extension for asp lib.
      
      * [Fix] Track the permutation sequence for progressive channel swap strategy.
      
      * Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings.
      
      * Fix the deprecated functions in ASP unit tests.
      
      * Fix the sparsity info typo in ASP lib.
      
      * [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search.
      
      * Update the README.md with identical random seed setting and NeurIPS info.
      
      * Integrate the Pybind11 enhancement of permutation search into ASP lib.
      89edb819
  30. 28 Jan, 2022 1 commit
  31. 19 Jan, 2022 1 commit
  32. 13 Jan, 2022 1 commit
  33. 16 Dec, 2021 1 commit
  34. 15 Dec, 2021 1 commit
  35. 14 Dec, 2021 1 commit
    • Masaki Kozuki's avatar
      Faster `--fast_multihead_attn` build (#1245) · 7ec8ed67
      Masaki Kozuki authored
      * merge .so files
      
      * odr
      
      * fix build
      
      * update import
      
      * apply psf/black with max line length of 120
      
      * update
      
      * fix
      
      * update
      
      * build fixed again but undefined symbol again
      
      * fix 2, still layer norm grad is undefined
      
      * remove unused cpp files
      
      * without layer_norm.cuh, import works
      
      * import fast_multihead_attn works...
      
      but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
      causing .shared objects not to be able to link `HostApplyLayerNorm` and
      `HostLayerNormGradient`?
      
      * clean up layer norm
      7ec8ed67
  36. 09 Dec, 2021 2 commits