Commits · 03d70c41ac392bde3824841e5137cde3825adec1 · OpenDAS / apex

01 Mar, 2023 1 commit

Cherry-picks some commits to replace torch.Tensor and remove dependency on six (#107) · 03d70c41

Hubert Lu authored Mar 01, 2023



* replace torch.Tensor with torch.empty (#1578)

* replace torch.Tensor with torch.empty

* nit

* nit

* torch.empty() must have args (#1584)

* use `torch.tensor` to create a tensor with initializer values (#1588)

* use `torch.tensor` with init values
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Update apex/contrib/sparsity/sparse_masklib.py

* remove torch._six
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* retire `torch._six`

as per the upstream commit of `b005ec62b9`.
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* use std collections.abc
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

03d70c41

15 Feb, 2023 1 commit

Grid optimization - Chunk_Size optimization. (#104) · b047a1f1

aspanday authored Feb 15, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.

* Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K).
In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320).
The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements.
This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex.
The set of performance along with comaprison with Torch is captured here
https://amdcloud.sharepoint.com/❌

/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8
See sheet chunk_opt.

* Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits.
changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers.
The change includes introducing multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels.

---------
Co-authored-by: aspanday <aspanday@amd.com>

b047a1f1

13 Feb, 2023 1 commit

Luise/gbn optimization (#105) · 56c283b6

luise.chen authored Feb 14, 2023

* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50

56c283b6

25 Jan, 2023 1 commit

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

14db5c27

20 Dec, 2022 1 commit

Update register keyword handling for C++17 (#100) · f05aaca0

Pruthvi Madugundu authored Dec 20, 2022

* Update register keyword handling for C++17

The keyword 'register' for storage class is removed in C++17,
so keeping it active for only c++14 and lower.

* Updates to the code

f05aaca0

10 Dec, 2022 1 commit
- Merge pull request #99 from ROCmSoftwarePlatform/dev/hubertlu/fused_dense_debug · 6e453f1a
  kkHuang-amd authored Dec 10, 2022
```
Fix a bug in fused_dense_cuda on ROCm
```
  6e453f1a
09 Dec, 2022 2 commits
- Add fused_dense in the extension unit test script · d63b5d1f
  hubertlu-tw authored Dec 09, 2022
  
  d63b5d1f
- Fix a bug in fused_dense_cuda on ROCm · e90ba51b
  hubertlu-tw authored Dec 09, 2022
  
  e90ba51b
06 Dec, 2022 2 commits

Unskip some unit tests related to issue #82 (#98) · 4dcf30a6

Hubert Lu authored Dec 06, 2022

* Unskip some unit tests related to issue #82

* Ensure test_state_dict to use capturable=True for torch.optim.Adam

* Fix TestFusedAdam tests in test_fused_optimizer.py

4dcf30a6

Consider both contiguous and channels_last tensors for FusedSGD (#97) · 9ebc53e5

Hubert Lu authored Dec 06, 2022

* Consider both contiguous and channel_last tensors for FusedSGD

* Consider all the memory formats in fused_sgd

* Add an unit test script for nhwc fused_sgd

9ebc53e5

21 Sep, 2022 1 commit

Make index_mul_2d extension backward compatible for Atomic header include (#96) · 719215bd

Hubert Lu authored Sep 21, 2022



* Make index_mul_2d extension backward compatible for Atomic header include

* Typo
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

719215bd

19 Sep, 2022 1 commit

Faster build (#95) · 89f5722c

Hubert Lu authored Sep 19, 2022

* Remove redundant import's and enable ninja for MHA extension

* Remove redundant CUDAExtension import's

89f5722c

08 Sep, 2022 4 commits
- Merge pull request #91 from ROCmSoftwarePlatform/dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 5acb8d00
  Jithun Nair authored Sep 08, 2022
```
Enable --focal_loss and --index_mul_2d extensions for ROCm
```
  5acb8d00
- Merge branch 'master' into dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 7a344314
  Jithun Nair authored Sep 08, 2022
  
  7a344314
- Enable --transducer extension for ROCm (#88) · ae5ca671
  Hubert Lu authored Sep 08, 2022
```
* Enable --transducer extension for ROCm

* Enable --transducer unit tests for ROCm

* Skip some failing tests in test_transducer_joint.py

* Skip test_transducer_joint_pack for transducer extension

* Keep transducer extension CUDA-compatible
```
  ae5ca671
- Merge pull request #87 from ROCmSoftwarePlatform/dev/hubertlu/apex_peer_memory_nccl_p2p · a53b4417
  Jithun Nair authored Sep 08, 2022
```
Enable --peer_memory and --nccl p2p extensions for ROCm
```
  a53b4417
07 Sep, 2022 2 commits
- Keep --peer_memory and --nccl_p2p CUDA-compatible · bc64ee83
  hubertlu-tw authored Sep 07, 2022
  
  bc64ee83
- Merge remote-tracking branch 'origin/master' into dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 9187ea1d
  hubertlu-tw authored Sep 07, 2022
  
  9187ea1d
26 Aug, 2022 1 commit

cached cast fix (#90) · a27b4e43

Hubert Lu authored Aug 26, 2022



* Handle len(cached_x.grad_fn.next_functions) == 1 in cached_cast

* Unskip the unit tests related to len(cached_x.grad_fn.next_functions) == 1
Co-authored-by: David Fan <jiafa@microsoft.com>

a27b4e43

23 Aug, 2022 2 commits
- Enable --focal_loss and --index_mul_2d_cuda extensions on ROCm · ebb4e88a
  hubertlu-tw authored Aug 23, 2022
  
  ebb4e88a
- add customized fused op index mulitiplication (#1438) · 40e15362
  hanbao authored Aug 02, 2022
```
Co-authored-by: Han Bao <hbao@nvidia.com>
```
  40e15362
22 Aug, 2022 2 commits
- Fixed peer halo exchange module test · fd0f7631
  Thor Johnsen authored Aug 15, 2022
  
  fd0f7631
- Enable --peer_memory and --nccl_p2p extensions for ROCm · c662c703
  hubertlu-tw authored Aug 22, 2022
  
  c662c703
15 Aug, 2022 1 commit
- Merge pull request #80 from ROCmSoftwarePlatform/IFU-master-2022-07-29 · 96850dfa
  Jithun Nair authored Aug 15, 2022
```
IFU-master-2022-07-29
```
  96850dfa
10 Aug, 2022 1 commit
- Skip a failing test introduced by a upstream PyTorch regression · cc5f83b5
  hubertlu-tw authored Aug 10, 2022
  
  cc5f83b5
09 Aug, 2022 7 commits
- Merge remote-tracking branch 'origin/dev/hubertlu/flaky_tests' into IFU-master-2022-07-29 · 12ff0e23
  hubertlu-tw authored Aug 09, 2022
  
  12ff0e23
- Remove some comments in run_test.py · cebbb04f
  hubertlu-tw authored Aug 09, 2022
  
  cebbb04f
- Merge remote-tracking branch 'origin/dev/hubertlu/flaky_tests' into IFU-master-2022-07-29 · f1f28ff6
  hubertlu-tw authored Aug 09, 2022
  
  f1f28ff6
- Remove run_pyprof_data and run_pyprof_nvtx unit tests · 4d567459
  hubertlu-tw authored Aug 09, 2022
  
  4d567459
- Update L0 unit test script · ced59fcc
  hubertlu-tw authored Aug 09, 2022
  
  ced59fcc
- Skip a flaky unit test · 8a8eb34f
  hubertlu-tw authored Aug 09, 2022
  
  8a8eb34f
- Skip some flaky unit tests · 975a0e53
  hubertlu-tw authored Aug 09, 2022
  
  975a0e53
08 Aug, 2022 6 commits
- Un-skip some tests and skip some flaky tests · 1b7b02ef
  hubertlu-tw authored Aug 08, 2022
  
  1b7b02ef
- Addd a wrapper to skip flaky unit tests. · 4cfbe05c
  hubertlu-tw authored Aug 08, 2022
  
  4cfbe05c
- Skip the failing unit tests from the FusedRMSNorm PR (#85) · 87fc4125
  Hubert Lu authored Aug 08, 2022
```
* Skip the failing unit tests from the FusedRMSNorm PR

* Update test_lamb.py
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
```
  87fc4125
- Fix the cuda-specific transformer utils for ROCm · 57dea7f2
  hubertlu-tw authored Aug 08, 2022
  
  57dea7f2
- Merge remote-tracking branch 'origin/master' into IFU-master-2022-07-29 · cb8b7a88
  hubertlu-tw authored Aug 08, 2022
  
  cb8b7a88
- Revert code changes to mutltihead_attn tests · 51783cc7
  hubertlu-tw authored Aug 08, 2022
  
  51783cc7
05 Aug, 2022 1 commit

Enable FusedRMSNorm (#78) · c97ebfab

Hubert Lu authored Aug 05, 2022



* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)

* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix and generate docs for FusedRMSNorm (#1285)

* [FusedRMSNorm doc] document where epsilon is added (#1295)

* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording

* Fix some bugs

* Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs

* Fix NaN issues in FusedRMSNorm

* Update test_fused_layer_norm.py

* Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm

* Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
Co-authored-by: eqy <eddiey@nvidia.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

c97ebfab

29 Jul, 2022 1 commit
- Fix some compiling errors · 038ed999
  hubertlu-tw authored Jul 29, 2022
  
  038ed999