Commits · 56c283b6141024bdebe1ebd424527a3b3bf5c7ab · OpenDAS / apex

13 Feb, 2023 1 commit

Luise/gbn optimization (#105) · 56c283b6

luise.chen authored Feb 14, 2023

* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50

56c283b6

25 Jan, 2023 1 commit

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

14db5c27

20 Dec, 2022 1 commit

Update register keyword handling for C++17 (#100) · f05aaca0

Pruthvi Madugundu authored Dec 20, 2022

* Update register keyword handling for C++17

The keyword 'register' for storage class is removed in C++17,
so keeping it active for only c++14 and lower.

* Updates to the code

f05aaca0

10 Dec, 2022 1 commit
- Merge pull request #99 from ROCmSoftwarePlatform/dev/hubertlu/fused_dense_debug · 6e453f1a
  kkHuang-amd authored Dec 10, 2022
```
Fix a bug in fused_dense_cuda on ROCm
```
  6e453f1a
09 Dec, 2022 2 commits
- Add fused_dense in the extension unit test script · d63b5d1f
  hubertlu-tw authored Dec 09, 2022
  
  d63b5d1f
- Fix a bug in fused_dense_cuda on ROCm · e90ba51b
  hubertlu-tw authored Dec 09, 2022
  
  e90ba51b
06 Dec, 2022 2 commits

Unskip some unit tests related to issue #82 (#98) · 4dcf30a6

Hubert Lu authored Dec 06, 2022

* Unskip some unit tests related to issue #82

* Ensure test_state_dict to use capturable=True for torch.optim.Adam

* Fix TestFusedAdam tests in test_fused_optimizer.py

4dcf30a6

Consider both contiguous and channels_last tensors for FusedSGD (#97) · 9ebc53e5

Hubert Lu authored Dec 06, 2022

* Consider both contiguous and channel_last tensors for FusedSGD

* Consider all the memory formats in fused_sgd

* Add an unit test script for nhwc fused_sgd

9ebc53e5

21 Sep, 2022 1 commit

Make index_mul_2d extension backward compatible for Atomic header include (#96) · 719215bd

Hubert Lu authored Sep 21, 2022



* Make index_mul_2d extension backward compatible for Atomic header include

* Typo
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

719215bd

19 Sep, 2022 1 commit

Faster build (#95) · 89f5722c

Hubert Lu authored Sep 19, 2022

* Remove redundant import's and enable ninja for MHA extension

* Remove redundant CUDAExtension import's

89f5722c

08 Sep, 2022 4 commits
- Merge pull request #91 from ROCmSoftwarePlatform/dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 5acb8d00
  Jithun Nair authored Sep 08, 2022
```
Enable --focal_loss and --index_mul_2d extensions for ROCm
```
  5acb8d00
- Merge branch 'master' into dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 7a344314
  Jithun Nair authored Sep 08, 2022
  
  7a344314
- Enable --transducer extension for ROCm (#88) · ae5ca671
  Hubert Lu authored Sep 08, 2022
```
* Enable --transducer extension for ROCm

* Enable --transducer unit tests for ROCm

* Skip some failing tests in test_transducer_joint.py

* Skip test_transducer_joint_pack for transducer extension

* Keep transducer extension CUDA-compatible
```
  ae5ca671
- Merge pull request #87 from ROCmSoftwarePlatform/dev/hubertlu/apex_peer_memory_nccl_p2p · a53b4417
  Jithun Nair authored Sep 08, 2022
```
Enable --peer_memory and --nccl p2p extensions for ROCm
```
  a53b4417
07 Sep, 2022 2 commits
- Keep --peer_memory and --nccl_p2p CUDA-compatible · bc64ee83
  hubertlu-tw authored Sep 07, 2022
  
  bc64ee83
- Merge remote-tracking branch 'origin/master' into dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 9187ea1d
  hubertlu-tw authored Sep 07, 2022
  
  9187ea1d
26 Aug, 2022 1 commit

cached cast fix (#90) · a27b4e43

Hubert Lu authored Aug 26, 2022



* Handle len(cached_x.grad_fn.next_functions) == 1 in cached_cast

* Unskip the unit tests related to len(cached_x.grad_fn.next_functions) == 1
Co-authored-by: David Fan <jiafa@microsoft.com>

a27b4e43

23 Aug, 2022 2 commits
- Enable --focal_loss and --index_mul_2d_cuda extensions on ROCm · ebb4e88a
  hubertlu-tw authored Aug 23, 2022
  
  ebb4e88a
- add customized fused op index mulitiplication (#1438) · 40e15362
  hanbao authored Aug 02, 2022
```
Co-authored-by: Han Bao <hbao@nvidia.com>
```
  40e15362
22 Aug, 2022 2 commits
- Fixed peer halo exchange module test · fd0f7631
  Thor Johnsen authored Aug 15, 2022
  
  fd0f7631
- Enable --peer_memory and --nccl_p2p extensions for ROCm · c662c703
  hubertlu-tw authored Aug 22, 2022
  
  c662c703
15 Aug, 2022 1 commit
- Merge pull request #80 from ROCmSoftwarePlatform/IFU-master-2022-07-29 · 96850dfa
  Jithun Nair authored Aug 15, 2022
```
IFU-master-2022-07-29
```
  96850dfa
10 Aug, 2022 1 commit
- Skip a failing test introduced by a upstream PyTorch regression · cc5f83b5
  hubertlu-tw authored Aug 10, 2022
  
  cc5f83b5
09 Aug, 2022 7 commits
- Merge remote-tracking branch 'origin/dev/hubertlu/flaky_tests' into IFU-master-2022-07-29 · 12ff0e23
  hubertlu-tw authored Aug 09, 2022
  
  12ff0e23
- Remove some comments in run_test.py · cebbb04f
  hubertlu-tw authored Aug 09, 2022
  
  cebbb04f
- Merge remote-tracking branch 'origin/dev/hubertlu/flaky_tests' into IFU-master-2022-07-29 · f1f28ff6
  hubertlu-tw authored Aug 09, 2022
  
  f1f28ff6
- Remove run_pyprof_data and run_pyprof_nvtx unit tests · 4d567459
  hubertlu-tw authored Aug 09, 2022
  
  4d567459
- Update L0 unit test script · ced59fcc
  hubertlu-tw authored Aug 09, 2022
  
  ced59fcc
- Skip a flaky unit test · 8a8eb34f
  hubertlu-tw authored Aug 09, 2022
  
  8a8eb34f
- Skip some flaky unit tests · 975a0e53
  hubertlu-tw authored Aug 09, 2022
  
  975a0e53
08 Aug, 2022 6 commits
- Un-skip some tests and skip some flaky tests · 1b7b02ef
  hubertlu-tw authored Aug 08, 2022
  
  1b7b02ef
- Addd a wrapper to skip flaky unit tests. · 4cfbe05c
  hubertlu-tw authored Aug 08, 2022
  
  4cfbe05c
- Skip the failing unit tests from the FusedRMSNorm PR (#85) · 87fc4125
  Hubert Lu authored Aug 08, 2022
```
* Skip the failing unit tests from the FusedRMSNorm PR

* Update test_lamb.py
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
```
  87fc4125
- Fix the cuda-specific transformer utils for ROCm · 57dea7f2
  hubertlu-tw authored Aug 08, 2022
  
  57dea7f2
- Merge remote-tracking branch 'origin/master' into IFU-master-2022-07-29 · cb8b7a88
  hubertlu-tw authored Aug 08, 2022
  
  cb8b7a88
- Revert code changes to mutltihead_attn tests · 51783cc7
  hubertlu-tw authored Aug 08, 2022
  
  51783cc7
05 Aug, 2022 1 commit

Enable FusedRMSNorm (#78) · c97ebfab

Hubert Lu authored Aug 05, 2022



* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)

* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix and generate docs for FusedRMSNorm (#1285)

* [FusedRMSNorm doc] document where epsilon is added (#1295)

* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording

* Fix some bugs

* Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs

* Fix NaN issues in FusedRMSNorm

* Update test_fused_layer_norm.py

* Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm

* Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
Co-authored-by: eqy <eddiey@nvidia.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

c97ebfab

29 Jul, 2022 3 commits
- Fix some compiling errors · 038ed999
  hubertlu-tw authored Jul 29, 2022
  
  038ed999
- Unskip run_transformer unit tests · bbf2c8d0
  hubertlu-tw authored Jul 29, 2022
  
  bbf2c8d0
- Merge remote-tracking branch 'upstream/master' into IFU-master-2022-07-29 · 795a5e5b
  hubertlu-tw authored Jul 29, 2022
  
  795a5e5b