1. 13 Feb, 2023 1 commit
    • luise.chen's avatar
      Luise/gbn optimization (#105) · 56c283b6
      luise.chen authored
      * GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS
      
      * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50
      
      * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
      56c283b6
  2. 25 Jan, 2023 1 commit
    • aspanday's avatar
      Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27
      aspanday authored
      * Updating BLOCK_SIZE to 1024.
      tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
      For now skipping test_bfloat16 for Adam in the unittest.
      Ran 17 other tests and ALL other tests pass!
      More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization
      
      .
      This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
      L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.
      
      * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
      Co-authored-by: default avataraspanday <aspanday@amd.com>
      14db5c27
  3. 20 Dec, 2022 1 commit
  4. 10 Dec, 2022 1 commit
  5. 09 Dec, 2022 2 commits
  6. 06 Dec, 2022 2 commits
  7. 21 Sep, 2022 1 commit
  8. 19 Sep, 2022 1 commit
    • Hubert Lu's avatar
      Faster build (#95) · 89f5722c
      Hubert Lu authored
      * Remove redundant import's and enable ninja for MHA extension
      
      * Remove redundant CUDAExtension import's
      89f5722c
  9. 08 Sep, 2022 4 commits
  10. 07 Sep, 2022 2 commits
  11. 26 Aug, 2022 1 commit
  12. 23 Aug, 2022 2 commits
  13. 22 Aug, 2022 2 commits
  14. 15 Aug, 2022 1 commit
  15. 10 Aug, 2022 1 commit
  16. 09 Aug, 2022 7 commits
  17. 08 Aug, 2022 6 commits
  18. 05 Aug, 2022 1 commit
    • Hubert Lu's avatar
      Enable FusedRMSNorm (#78) · c97ebfab
      Hubert Lu authored
      
      
      * FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)
      
      * FusedRMSNorm based on FusedLayerNorm
      
      * refactor duplicated kernels
      
      * delete comments
      
      * delete comments
      
      * cleanup
      
      * cleanup
      
      * cleanup, fixed clobbering forward_affine_mixed_dtypes
      
      * fix pybind naming and add MixedFused test
      
      * undo skipping
      
      * check elementwise_affine
      
      * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py
      
      Oof, nice catch, thanks
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      
      * fix and generate docs for FusedRMSNorm (#1285)
      
      * [FusedRMSNorm doc] document where epsilon is added (#1295)
      
      * [FusedRMSNorm doc] add epsilon to formula
      
      * correct
      
      * better wording
      
      * Fix some bugs
      
      * Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs
      
      * Fix NaN issues in FusedRMSNorm
      
      * Update test_fused_layer_norm.py
      
      * Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm
      
      * Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
      Co-authored-by: default avatareqy <eddiey@nvidia.com>
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      Co-authored-by: default avatarStas Bekman <stas00@users.noreply.github.com>
      c97ebfab
  19. 29 Jul, 2022 3 commits