- 03 Jun, 2025 1 commit
-
-
fengzch-das authored
-
- 30 May, 2025 1 commit
-
-
fengzch-das authored
-
- 29 May, 2025 1 commit
-
-
fengzch-das authored
-
- 16 May, 2025 2 commits
-
-
limm authored
- 12 May, 2025 2 commits
-
-
limm authored
- 13 Mar, 2025 2 commits
- 09 Oct, 2023 1 commit
-
-
flyingdown authored
fix revert fused_dense to fp32_r See merge request aicomponent/apex!4
-
- 08 Oct, 2023 1 commit
-
-
flyingdown authored
-
- 19 Sep, 2023 2 commits
-
-
flyingdown authored
revert multihead_attn to fp32_r See merge request aicomponent/apex!3
-
root authored
-
- 18 Sep, 2023 3 commits
-
-
flyingdown authored
Develop See merge request aicomponent/apex!1
-
flyingdown authored
-
flyingdown authored
-
- 06 Sep, 2023 2 commits
-
-
Peng authored
Revert "Changes to support hipblas migration (#113)"
-
Pruthvi Madugundu authored
This reverts commit 8fc9b21f.
-
- 18 Aug, 2023 1 commit
-
-
- 11 Aug, 2023 1 commit
-
-
Pruthvi Madugundu authored
-
- 20 Jun, 2023 1 commit
-
-
Pruthvi Madugundu authored
- Cherry-pick of https://github.com/NVIDIA/apex/pull/1669
-
- 12 Jun, 2023 1 commit
-
-
flyingdown authored
2.添加环境变量APEX_ROCBLAS_GEMM_ALLOW_HALF用于控制是否使用fp16r 3.添加dcu版本信息 whl包名修改 readme更新安装步骤
-
- 08 May, 2023 1 commit
-
-
flyingdown authored
fix test for torch 1.10.0
-
- 23 Apr, 2023 11 commits
-
-
Pruthvi Madugundu authored
-
luise.chen authored
* Add fused_lars optimizer * Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW * Add flow of using nesterov in FusedLARS
-
Hubert Lu authored
* replace torch.Tensor with torch.empty (#1578) * replace torch.Tensor with torch.empty * nit * nit * torch.empty() must have args (#1584) * use `torch.tensor` to create a tensor with initializer values (#1588) * use `torch.tensor` with init values Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * Update apex/contrib/sparsity/sparse_masklib.py * remove torch._six Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * retire `torch._six` as per the upstream commit of `b005ec62b9`. Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * use std collections.abc Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by:
Nouamane Tazi <nouamane98@gmail.com> Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com>
-
aspanday authored
* Updating BLOCK_SIZE to 1024. tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved. For now skipping test_bfloat16 for Adam in the unittest. Ran 17 other tests and ALL other tests pass! More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization. This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers. L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails. * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam. * Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K). In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320). The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements. This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex. The set of performance along with comaprison with Torch is captured here https://amdcloud.sharepoint.com/
❌ /r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8 See sheet chunk_opt. * Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits. changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers. The change includes introducing multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels. --------- Co-authored-by:aspanday <aspanday@amd.com>
-
luise.chen authored
* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50 * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
-
aspanday authored
* Updating BLOCK_SIZE to 1024. tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved. For now skipping test_bfloat16 for Adam in the unittest. Ran 17 other tests and ALL other tests pass! More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization . This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers. L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails. * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam. Co-authored-by:
aspanday <aspanday@amd.com>
-
Pruthvi Madugundu authored
* Update register keyword handling for C++17 The keyword 'register' for storage class is removed in C++17, so keeping it active for only c++14 and lower. * Updates to the code
-
hubertlu-tw authored
-
hubertlu-tw authored
-
Hubert Lu authored
* Unskip some unit tests related to issue #82 * Ensure test_state_dict to use capturable=True for torch.optim.Adam * Fix TestFusedAdam tests in test_fused_optimizer.py
-
Hubert Lu authored
* Consider both contiguous and channel_last tensors for FusedSGD * Consider all the memory formats in fused_sgd * Add an unit test script for nhwc fused_sgd
-
- 30 Mar, 2023 1 commit
-
-
Pruthvi Madugundu authored
-
- 23 Mar, 2023 1 commit
-
-
luise.chen authored
* Add fused_lars optimizer * Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW * Add flow of using nesterov in FusedLARS
-
- 01 Mar, 2023 1 commit
-
-
Hubert Lu authored
* replace torch.Tensor with torch.empty (#1578) * replace torch.Tensor with torch.empty * nit * nit * torch.empty() must have args (#1584) * use `torch.tensor` to create a tensor with initializer values (#1588) * use `torch.tensor` with init values Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * Update apex/contrib/sparsity/sparse_masklib.py * remove torch._six Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * retire `torch._six` as per the upstream commit of `b005ec62b9`. Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * use std collections.abc Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by:
Nouamane Tazi <nouamane98@gmail.com> Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com>
-
- 15 Feb, 2023 1 commit
-
-
aspanday authored
* Updating BLOCK_SIZE to 1024. tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved. For now skipping test_bfloat16 for Adam in the unittest. Ran 17 other tests and ALL other tests pass! More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization. This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers. L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails. * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam. * Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K). In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320). The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements. This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex. The set of performance along with comaprison with Torch is captured here https://amdcloud.sharepoint.com/
❌ /r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8 See sheet chunk_opt. * Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits. changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers. The change includes introducing multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels. --------- Co-authored-by:aspanday <aspanday@amd.com>
-
- 13 Feb, 2023 1 commit
-
-
luise.chen authored
* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50 * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
-
- 25 Jan, 2023 1 commit
-
-
aspanday authored
* Updating BLOCK_SIZE to 1024. tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved. For now skipping test_bfloat16 for Adam in the unittest. Ran 17 other tests and ALL other tests pass! More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization . This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers. L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails. * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam. Co-authored-by:
aspanday <aspanday@amd.com>
-