Commits · e70ac2fdabbd50aa6699f879fba7c7d9de259f68 · OpenDAS / apex

03 Jun, 2025 1 commit
- Update README.md · e70ac2fd
  fengzch-das authored Jun 03, 2025
  
  e70ac2fd
30 May, 2025 1 commit
- update readme · 4a23aa46
  fengzch-das authored May 30, 2025
  
  4a23aa46
29 May, 2025 1 commit
- update readme · e4f67dc0
  fengzch-das authored May 29, 2025
  
  e4f67dc0
16 May, 2025 2 commits
- Merge branch 'fix_test' into '24.04.1-dtk25.04' · 9090eec1
  limm authored May 16, 2025
```
block the _driver_version parameter

See merge request !6
```
  9090eec1
- block the _driver_version parameter · a4a744f3
  limm authored May 16, 2025
  
  a4a744f3
12 May, 2025 2 commits
- Merge branch 'fix_24.04.1' into '24.04.1-dtk25.04' · 82d3aa12
  limm authored May 12, 2025
```
fix README.md

See merge request !5
```
  82d3aa12
- fix README.md · 26a8b8c7
  limm authored May 12, 2025
  
  26a8b8c7
13 Mar, 2025 2 commits
- add new files · 1811808c
  JR_ZZU authored Mar 13, 2025
  
  1811808c
- delete origin files · c2b62b7f
  JR_ZZU authored Mar 13, 2025
  
  c2b62b7f
09 Oct, 2023 1 commit
- Merge branch 'fused_dense_develop' into 'master' · 2a4864d5
  flyingdown authored Oct 09, 2023
```
fix revert fused_dense to fp32_r

See merge request aicomponent/apex!4
```
  2a4864d5
08 Oct, 2023 1 commit
- fix revert fused_dense to fp32_r · 4d3ee390
  flyingdown authored Oct 08, 2023
  
  4d3ee390
19 Sep, 2023 2 commits
- Merge branch 'multihead_attn' into 'master' · 0f6bf5a9
  flyingdown authored Sep 19, 2023
```
revert multihead_attn to fp32_r

See merge request aicomponent/apex!3
```
  0f6bf5a9
- revert multihead_attn to fp32_r · 227be6be
  root authored Sep 19, 2023
  
  227be6be
18 Sep, 2023 3 commits
- Merge branch 'develop' into 'master' · 412a8ac5
  flyingdown authored Sep 18, 2023
```
Develop

See merge request aicomponent/apex!1
```
  412a8ac5
- Develop · 9765f725
  flyingdown authored Sep 18, 2023
  
  9765f725
- merge mirror master · b5d7745d
  flyingdown authored Sep 18, 2023
  
  b5d7745d
06 Sep, 2023 2 commits
- Merge pull request #116 from ROCmSoftwarePlatform/revert_hipblas · 3ba7192d
  Peng authored Sep 06, 2023
```
Revert "Changes to support hipblas migration (#113)"
```
  3ba7192d
- Revert "Changes to support hipblas migration (#113)" · e4d21865
  Pruthvi Madugundu authored Sep 06, 2023
```
This reverts commit 8fc9b21f.
```
  e4d21865
18 Aug, 2023 1 commit
- Merge branch 'master' of https://github.com/ROCmSoftwarePlatform/apex · 03204b84
  flyingdown authored Aug 18, 2023
  
  03204b84
11 Aug, 2023 1 commit
- Changes to support hipblas migration (#113) · 8fc9b21f
  Pruthvi Madugundu authored Aug 11, 2023
  
  8fc9b21f
20 Jun, 2023 1 commit
- Adding pyproject.toml file (#112) · 10c74820
  Pruthvi Madugundu authored Jun 20, 2023
```
- Cherry-pick of https://github.com/NVIDIA/apex/pull/1669
```
  10c74820
12 Jun, 2023 1 commit

flyingdown authored Jun 06, 2023

2.添加环境变量APEX_ROCBLAS_GEMM_ALLOW_HALF用于控制是否使用fp16r
3.添加dcu版本信息

whl包名修改

readme更新安装步骤

f8b650c8

08 May, 2023 1 commit
- add README_HIP · 2c6c0f28
  flyingdown authored May 08, 2023
```
fix test for torch 1.10.0
```
  2c6c0f28
23 Apr, 2023 11 commits

Update rccl header include path (#110) · 2d8b3600
Pruthvi Madugundu authored Mar 29, 2023

2d8b3600

Add FusedLARS optimizer (#109) · e519c1e3

luise.chen authored Mar 24, 2023

* Add fused_lars optimizer

* Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW

* Add flow of using nesterov in FusedLARS

e519c1e3

Cherry-picks some commits to replace torch.Tensor and remove dependency on six (#107) · 3d72ea06

Hubert Lu authored Mar 01, 2023



* replace torch.Tensor with torch.empty (#1578)

* replace torch.Tensor with torch.empty

* nit

* nit

* torch.empty() must have args (#1584)

* use `torch.tensor` to create a tensor with initializer values (#1588)

* use `torch.tensor` with init values
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Update apex/contrib/sparsity/sparse_masklib.py

* remove torch._six
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* retire `torch._six`

as per the upstream commit of `b005ec62b9`.
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* use std collections.abc
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

3d72ea06

Grid optimization - Chunk_Size optimization. (#104) · 1578c0c7

aspanday authored Feb 15, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.

* Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K).
In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320).
The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements.
This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex.
The set of performance along with comaprison with Torch is captured here
https://amdcloud.sharepoint.com/❌

/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8
See sheet chunk_opt.

* Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits.
changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers.
The change includes introducing multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels.

---------
Co-authored-by: aspanday <aspanday@amd.com>

1578c0c7

Luise/gbn optimization (#105) · cdc17060

luise.chen authored Feb 14, 2023

* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50

cdc17060

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 06053e19

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

06053e19

Update register keyword handling for C++17 (#100) · f34cade5

Pruthvi Madugundu authored Dec 20, 2022

* Update register keyword handling for C++17

The keyword 'register' for storage class is removed in C++17,
so keeping it active for only c++14 and lower.

* Updates to the code

f34cade5

Add fused_dense in the extension unit test script · 722e1c3f
hubertlu-tw authored Dec 09, 2022

722e1c3f
Fix a bug in fused_dense_cuda on ROCm · 5c373f70
hubertlu-tw authored Dec 09, 2022

5c373f70

Unskip some unit tests related to issue #82 (#98) · 2951440a

Hubert Lu authored Dec 06, 2022

* Unskip some unit tests related to issue #82

* Ensure test_state_dict to use capturable=True for torch.optim.Adam

* Fix TestFusedAdam tests in test_fused_optimizer.py

2951440a

Consider both contiguous and channels_last tensors for FusedSGD (#97) · 9a13347c

Hubert Lu authored Dec 06, 2022

* Consider both contiguous and channel_last tensors for FusedSGD

* Consider all the memory formats in fused_sgd

* Add an unit test script for nhwc fused_sgd

9a13347c

30 Mar, 2023 1 commit
- Update rccl header include path (#110) · 18921471
  Pruthvi Madugundu authored Mar 29, 2023
  
  18921471
23 Mar, 2023 1 commit

Add FusedLARS optimizer (#109) · 7a428776

luise.chen authored Mar 24, 2023

* Add fused_lars optimizer

* Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW

* Add flow of using nesterov in FusedLARS

7a428776

01 Mar, 2023 1 commit

Cherry-picks some commits to replace torch.Tensor and remove dependency on six (#107) · 03d70c41

Hubert Lu authored Mar 01, 2023



* replace torch.Tensor with torch.empty (#1578)

* replace torch.Tensor with torch.empty

* nit

* nit

* torch.empty() must have args (#1584)

* use `torch.tensor` to create a tensor with initializer values (#1588)

* use `torch.tensor` with init values
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Update apex/contrib/sparsity/sparse_masklib.py

* remove torch._six
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* retire `torch._six`

as per the upstream commit of `b005ec62b9`.
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* use std collections.abc
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

03d70c41

15 Feb, 2023 1 commit

Grid optimization - Chunk_Size optimization. (#104) · b047a1f1

aspanday authored Feb 15, 2023

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.

---------
Co-authored-by: aspanday <aspanday@amd.com>

b047a1f1

13 Feb, 2023 1 commit

Luise/gbn optimization (#105) · 56c283b6

luise.chen authored Feb 14, 2023

* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50

56c283b6

25 Jan, 2023 1 commit

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

14db5c27