Commits · 03204b8451ba962a0844621c22fd3748d65dfc11 · OpenDAS / apex

18 Aug, 2023 1 commit
- Merge branch 'master' of https://github.com/ROCmSoftwarePlatform/apex · 03204b84
  flyingdown authored Aug 18, 2023
  
  03204b84
11 Aug, 2023 1 commit
- Changes to support hipblas migration (#113) · 8fc9b21f
  Pruthvi Madugundu authored Aug 11, 2023
  
  8fc9b21f
20 Jun, 2023 1 commit
- Adding pyproject.toml file (#112) · 10c74820
  Pruthvi Madugundu authored Jun 20, 2023
```
- Cherry-pick of https://github.com/NVIDIA/apex/pull/1669
```
  10c74820
12 Jun, 2023 1 commit

flyingdown authored Jun 06, 2023

2.添加环境变量APEX_ROCBLAS_GEMM_ALLOW_HALF用于控制是否使用fp16r
3.添加dcu版本信息

whl包名修改

readme更新安装步骤

f8b650c8

08 May, 2023 1 commit
- add README_HIP · 2c6c0f28
  flyingdown authored May 08, 2023
```
fix test for torch 1.10.0
```
  2c6c0f28
23 Apr, 2023 11 commits

Update rccl header include path (#110) · 2d8b3600
Pruthvi Madugundu authored Mar 29, 2023

2d8b3600

Add FusedLARS optimizer (#109) · e519c1e3

luise.chen authored Mar 24, 2023

* Add fused_lars optimizer

* Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW

* Add flow of using nesterov in FusedLARS

e519c1e3

Cherry-picks some commits to replace torch.Tensor and remove dependency on six (#107) · 3d72ea06

Hubert Lu authored Mar 01, 2023



* replace torch.Tensor with torch.empty (#1578)

* replace torch.Tensor with torch.empty

* nit

* nit

* torch.empty() must have args (#1584)

* use `torch.tensor` to create a tensor with initializer values (#1588)

* use `torch.tensor` with init values
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Update apex/contrib/sparsity/sparse_masklib.py

* remove torch._six
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* retire `torch._six`

as per the upstream commit of `b005ec62b9`.
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* use std collections.abc
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

3d72ea06

Grid optimization - Chunk_Size optimization. (#104) · 1578c0c7

aspanday authored Feb 15, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.

* Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K).
In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320).
The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements.
This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex.
The set of performance along with comaprison with Torch is captured here
https://amdcloud.sharepoint.com/❌

/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8
See sheet chunk_opt.

* Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits.
changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers.
The change includes introducing multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels.

---------
Co-authored-by: aspanday <aspanday@amd.com>

1578c0c7

Luise/gbn optimization (#105) · cdc17060

luise.chen authored Feb 14, 2023

* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50

cdc17060

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 06053e19

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

06053e19

Update register keyword handling for C++17 (#100) · f34cade5

Pruthvi Madugundu authored Dec 20, 2022

* Update register keyword handling for C++17

The keyword 'register' for storage class is removed in C++17,
so keeping it active for only c++14 and lower.

* Updates to the code

f34cade5

Add fused_dense in the extension unit test script · 722e1c3f
hubertlu-tw authored Dec 09, 2022

722e1c3f
Fix a bug in fused_dense_cuda on ROCm · 5c373f70
hubertlu-tw authored Dec 09, 2022

5c373f70

Unskip some unit tests related to issue #82 (#98) · 2951440a

Hubert Lu authored Dec 06, 2022

* Unskip some unit tests related to issue #82

* Ensure test_state_dict to use capturable=True for torch.optim.Adam

* Fix TestFusedAdam tests in test_fused_optimizer.py

2951440a

Consider both contiguous and channels_last tensors for FusedSGD (#97) · 9a13347c

Hubert Lu authored Dec 06, 2022

* Consider both contiguous and channel_last tensors for FusedSGD

* Consider all the memory formats in fused_sgd

* Add an unit test script for nhwc fused_sgd

9a13347c

30 Mar, 2023 1 commit
- Update rccl header include path (#110) · 18921471
  Pruthvi Madugundu authored Mar 29, 2023
  
  18921471
23 Mar, 2023 1 commit

Add FusedLARS optimizer (#109) · 7a428776

luise.chen authored Mar 24, 2023

* Add fused_lars optimizer

* Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW

* Add flow of using nesterov in FusedLARS

7a428776

01 Mar, 2023 1 commit

Cherry-picks some commits to replace torch.Tensor and remove dependency on six (#107) · 03d70c41

Hubert Lu authored Mar 01, 2023



* replace torch.Tensor with torch.empty (#1578)

* replace torch.Tensor with torch.empty

* nit

* nit

* torch.empty() must have args (#1584)

* use `torch.tensor` to create a tensor with initializer values (#1588)

* use `torch.tensor` with init values
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Update apex/contrib/sparsity/sparse_masklib.py

* remove torch._six
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* retire `torch._six`

as per the upstream commit of `b005ec62b9`.
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* use std collections.abc
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

03d70c41

15 Feb, 2023 1 commit

Grid optimization - Chunk_Size optimization. (#104) · b047a1f1

aspanday authored Feb 15, 2023

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.

---------
Co-authored-by: aspanday <aspanday@amd.com>

b047a1f1

13 Feb, 2023 1 commit

Luise/gbn optimization (#105) · 56c283b6

luise.chen authored Feb 14, 2023

* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50

56c283b6

25 Jan, 2023 1 commit

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

14db5c27

20 Dec, 2022 1 commit

Update register keyword handling for C++17 (#100) · f05aaca0

Pruthvi Madugundu authored Dec 20, 2022

* Update register keyword handling for C++17

The keyword 'register' for storage class is removed in C++17,
so keeping it active for only c++14 and lower.

* Updates to the code

f05aaca0

10 Dec, 2022 1 commit
- Merge pull request #99 from ROCmSoftwarePlatform/dev/hubertlu/fused_dense_debug · 6e453f1a
  kkHuang-amd authored Dec 10, 2022
```
Fix a bug in fused_dense_cuda on ROCm
```
  6e453f1a
09 Dec, 2022 2 commits
- Add fused_dense in the extension unit test script · d63b5d1f
  hubertlu-tw authored Dec 09, 2022
  
  d63b5d1f
- Fix a bug in fused_dense_cuda on ROCm · e90ba51b
  hubertlu-tw authored Dec 09, 2022
  
  e90ba51b
06 Dec, 2022 2 commits

Unskip some unit tests related to issue #82 (#98) · 4dcf30a6

Hubert Lu authored Dec 06, 2022

* Unskip some unit tests related to issue #82

* Ensure test_state_dict to use capturable=True for torch.optim.Adam

* Fix TestFusedAdam tests in test_fused_optimizer.py

4dcf30a6

Consider both contiguous and channels_last tensors for FusedSGD (#97) · 9ebc53e5

Hubert Lu authored Dec 06, 2022

* Consider both contiguous and channel_last tensors for FusedSGD

* Consider all the memory formats in fused_sgd

* Add an unit test script for nhwc fused_sgd

9ebc53e5

14 Nov, 2022 1 commit
- modify rocblas_gemm_ex's compute_type to rocblas_datatype_f16_r for fp16 · db7007ae
  flyingdown authored Nov 14, 2022
  
  db7007ae
11 Nov, 2022 1 commit
- add --gpu-max-threads-per-block=1024 options · 32ab028c
  flyingdown authored Nov 11, 2022
  
  32ab028c
08 Nov, 2022 2 commits
- 修改setup.py，修复编译错误，适配dtk-22.10 · b10621d1
  flyingdown authored Nov 08, 2022
  
  b10621d1
- replace distributed_fused_lamb.py · 86dfa18d
  flyingdown authored Aug 04, 2022
  
  86dfa18d
21 Sep, 2022 1 commit

Make index_mul_2d extension backward compatible for Atomic header include (#96) · 719215bd

Hubert Lu authored Sep 21, 2022



* Make index_mul_2d extension backward compatible for Atomic header include

* Typo
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

719215bd

19 Sep, 2022 1 commit

Faster build (#95) · 89f5722c

Hubert Lu authored Sep 19, 2022

* Remove redundant import's and enable ninja for MHA extension

* Remove redundant CUDAExtension import's

89f5722c

08 Sep, 2022 4 commits
- Merge pull request #91 from ROCmSoftwarePlatform/dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 5acb8d00
  Jithun Nair authored Sep 08, 2022
```
Enable --focal_loss and --index_mul_2d extensions for ROCm
```
  5acb8d00
- Merge branch 'master' into dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 7a344314
  Jithun Nair authored Sep 08, 2022
  
  7a344314
- Enable --transducer extension for ROCm (#88) · ae5ca671
  Hubert Lu authored Sep 08, 2022
```
* Enable --transducer extension for ROCm

* Enable --transducer unit tests for ROCm

* Skip some failing tests in test_transducer_joint.py

* Skip test_transducer_joint_pack for transducer extension

* Keep transducer extension CUDA-compatible
```
  ae5ca671
- Merge pull request #87 from ROCmSoftwarePlatform/dev/hubertlu/apex_peer_memory_nccl_p2p · a53b4417
  Jithun Nair authored Sep 08, 2022
```
Enable --peer_memory and --nccl p2p extensions for ROCm
```
  a53b4417
07 Sep, 2022 2 commits
- Keep --peer_memory and --nccl_p2p CUDA-compatible · bc64ee83
  hubertlu-tw authored Sep 07, 2022
  
  bc64ee83
- Merge remote-tracking branch 'origin/master' into dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 9187ea1d
  hubertlu-tw authored Sep 07, 2022
  
  9187ea1d