1. 01 Mar, 2023 1 commit
  2. 15 Feb, 2023 1 commit
    • aspanday's avatar
      Grid optimization - Chunk_Size optimization. (#104) · b047a1f1
      aspanday authored
      * Updating BLOCK_SIZE to 1024.
      tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
      For now skipping test_bfloat16 for Adam in the unittest.
      Ran 17 other tests and ALL other tests pass!
      More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization.
      This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
      L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.
      
      * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
      
      * Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K).
      In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320).
      The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements.
      This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex.
      The set of performance along with comaprison with Torch is captured here
      https://amdcloud.sharepoint.com/
      
      /r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8
      See sheet chunk_opt.
      
      * Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits.
      changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers.
      The change includes introducing  multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels.
      
      ---------
      Co-authored-by: default avataraspanday <aspanday@amd.com>
      b047a1f1
  3. 13 Feb, 2023 1 commit
    • luise.chen's avatar
      Luise/gbn optimization (#105) · 56c283b6
      luise.chen authored
      * GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS
      
      * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50
      
      * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
      56c283b6
  4. 25 Jan, 2023 1 commit
    • aspanday's avatar
      Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27
      aspanday authored
      * Updating BLOCK_SIZE to 1024.
      tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
      For now skipping test_bfloat16 for Adam in the unittest.
      Ran 17 other tests and ALL other tests pass!
      More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization
      
      .
      This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
      L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.
      
      * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
      Co-authored-by: default avataraspanday <aspanday@amd.com>
      14db5c27
  5. 20 Dec, 2022 1 commit
  6. 10 Dec, 2022 1 commit
  7. 09 Dec, 2022 2 commits
  8. 06 Dec, 2022 2 commits
  9. 21 Sep, 2022 1 commit
  10. 19 Sep, 2022 1 commit
    • Hubert Lu's avatar
      Faster build (#95) · 89f5722c
      Hubert Lu authored
      * Remove redundant import's and enable ninja for MHA extension
      
      * Remove redundant CUDAExtension import's
      89f5722c
  11. 08 Sep, 2022 4 commits
  12. 07 Sep, 2022 2 commits
  13. 26 Aug, 2022 1 commit
  14. 23 Aug, 2022 2 commits
  15. 22 Aug, 2022 2 commits
  16. 15 Aug, 2022 1 commit
  17. 10 Aug, 2022 1 commit
  18. 09 Aug, 2022 7 commits
  19. 08 Aug, 2022 6 commits
  20. 05 Aug, 2022 1 commit
    • Hubert Lu's avatar
      Enable FusedRMSNorm (#78) · c97ebfab
      Hubert Lu authored
      
      
      * FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)
      
      * FusedRMSNorm based on FusedLayerNorm
      
      * refactor duplicated kernels
      
      * delete comments
      
      * delete comments
      
      * cleanup
      
      * cleanup
      
      * cleanup, fixed clobbering forward_affine_mixed_dtypes
      
      * fix pybind naming and add MixedFused test
      
      * undo skipping
      
      * check elementwise_affine
      
      * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py
      
      Oof, nice catch, thanks
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      
      * fix and generate docs for FusedRMSNorm (#1285)
      
      * [FusedRMSNorm doc] document where epsilon is added (#1295)
      
      * [FusedRMSNorm doc] add epsilon to formula
      
      * correct
      
      * better wording
      
      * Fix some bugs
      
      * Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs
      
      * Fix NaN issues in FusedRMSNorm
      
      * Update test_fused_layer_norm.py
      
      * Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm
      
      * Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
      Co-authored-by: default avatareqy <eddiey@nvidia.com>
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      Co-authored-by: default avatarStas Bekman <stas00@users.noreply.github.com>
      c97ebfab
  21. 29 Jul, 2022 1 commit