Commits · db7007ae332880ded09e2bf135e3a3eeebd8a7f2 · OpenDAS / apex

14 Nov, 2022 1 commit
- modify rocblas_gemm_ex's compute_type to rocblas_datatype_f16_r for fp16 · db7007ae
  flyingdown authored Nov 14, 2022
  
  db7007ae
11 Nov, 2022 1 commit
- add --gpu-max-threads-per-block=1024 options · 32ab028c
  flyingdown authored Nov 11, 2022
  
  32ab028c
08 Nov, 2022 2 commits
- 修改setup.py，修复编译错误，适配dtk-22.10 · b10621d1
  flyingdown authored Nov 08, 2022
  
  b10621d1
- replace distributed_fused_lamb.py · 86dfa18d
  flyingdown authored Aug 04, 2022
  
  86dfa18d
21 Sep, 2022 1 commit

Make index_mul_2d extension backward compatible for Atomic header include (#96) · 719215bd

Hubert Lu authored Sep 21, 2022



* Make index_mul_2d extension backward compatible for Atomic header include

* Typo
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

719215bd

19 Sep, 2022 1 commit

Faster build (#95) · 89f5722c

Hubert Lu authored Sep 19, 2022

* Remove redundant import's and enable ninja for MHA extension

* Remove redundant CUDAExtension import's

89f5722c

08 Sep, 2022 4 commits
- Merge pull request #91 from ROCmSoftwarePlatform/dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 5acb8d00
  Jithun Nair authored Sep 08, 2022
```
Enable --focal_loss and --index_mul_2d extensions for ROCm
```
  5acb8d00
- Merge branch 'master' into dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 7a344314
  Jithun Nair authored Sep 08, 2022
  
  7a344314
- Enable --transducer extension for ROCm (#88) · ae5ca671
  Hubert Lu authored Sep 08, 2022
```
* Enable --transducer extension for ROCm

* Enable --transducer unit tests for ROCm

* Skip some failing tests in test_transducer_joint.py

* Skip test_transducer_joint_pack for transducer extension

* Keep transducer extension CUDA-compatible
```
  ae5ca671
- Merge pull request #87 from ROCmSoftwarePlatform/dev/hubertlu/apex_peer_memory_nccl_p2p · a53b4417
  Jithun Nair authored Sep 08, 2022
```
Enable --peer_memory and --nccl p2p extensions for ROCm
```
  a53b4417
07 Sep, 2022 2 commits
- Keep --peer_memory and --nccl_p2p CUDA-compatible · bc64ee83
  hubertlu-tw authored Sep 07, 2022
  
  bc64ee83
- Merge remote-tracking branch 'origin/master' into dev/hubertlu/focal_loss_and_index_mul_2d_cuda · 9187ea1d
  hubertlu-tw authored Sep 07, 2022
  
  9187ea1d
26 Aug, 2022 1 commit

cached cast fix (#90) · a27b4e43

Hubert Lu authored Aug 26, 2022



* Handle len(cached_x.grad_fn.next_functions) == 1 in cached_cast

* Unskip the unit tests related to len(cached_x.grad_fn.next_functions) == 1
Co-authored-by: David Fan <jiafa@microsoft.com>

a27b4e43

23 Aug, 2022 2 commits
- Enable --focal_loss and --index_mul_2d_cuda extensions on ROCm · ebb4e88a
  hubertlu-tw authored Aug 23, 2022
  
  ebb4e88a
- add customized fused op index mulitiplication (#1438) · 40e15362
  hanbao authored Aug 02, 2022
```
Co-authored-by: Han Bao <hbao@nvidia.com>
```
  40e15362
22 Aug, 2022 2 commits
- Fixed peer halo exchange module test · fd0f7631
  Thor Johnsen authored Aug 15, 2022
  
  fd0f7631
- Enable --peer_memory and --nccl_p2p extensions for ROCm · c662c703
  hubertlu-tw authored Aug 22, 2022
  
  c662c703
15 Aug, 2022 1 commit
- Merge pull request #80 from ROCmSoftwarePlatform/IFU-master-2022-07-29 · 96850dfa
  Jithun Nair authored Aug 15, 2022
```
IFU-master-2022-07-29
```
  96850dfa
10 Aug, 2022 1 commit
- Skip a failing test introduced by a upstream PyTorch regression · cc5f83b5
  hubertlu-tw authored Aug 10, 2022
  
  cc5f83b5
09 Aug, 2022 7 commits
- Merge remote-tracking branch 'origin/dev/hubertlu/flaky_tests' into IFU-master-2022-07-29 · 12ff0e23
  hubertlu-tw authored Aug 09, 2022
  
  12ff0e23
- Remove some comments in run_test.py · cebbb04f
  hubertlu-tw authored Aug 09, 2022
  
  cebbb04f
- Merge remote-tracking branch 'origin/dev/hubertlu/flaky_tests' into IFU-master-2022-07-29 · f1f28ff6
  hubertlu-tw authored Aug 09, 2022
  
  f1f28ff6
- Remove run_pyprof_data and run_pyprof_nvtx unit tests · 4d567459
  hubertlu-tw authored Aug 09, 2022
  
  4d567459
- Update L0 unit test script · ced59fcc
  hubertlu-tw authored Aug 09, 2022
  
  ced59fcc
- Skip a flaky unit test · 8a8eb34f
  hubertlu-tw authored Aug 09, 2022
  
  8a8eb34f
- Skip some flaky unit tests · 975a0e53
  hubertlu-tw authored Aug 09, 2022
  
  975a0e53
08 Aug, 2022 6 commits
- Un-skip some tests and skip some flaky tests · 1b7b02ef
  hubertlu-tw authored Aug 08, 2022
  
  1b7b02ef
- Addd a wrapper to skip flaky unit tests. · 4cfbe05c
  hubertlu-tw authored Aug 08, 2022
  
  4cfbe05c
- Skip the failing unit tests from the FusedRMSNorm PR (#85) · 87fc4125
  Hubert Lu authored Aug 08, 2022
```
* Skip the failing unit tests from the FusedRMSNorm PR

* Update test_lamb.py
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
```
  87fc4125
- Fix the cuda-specific transformer utils for ROCm · 57dea7f2
  hubertlu-tw authored Aug 08, 2022
  
  57dea7f2
- Merge remote-tracking branch 'origin/master' into IFU-master-2022-07-29 · cb8b7a88
  hubertlu-tw authored Aug 08, 2022
  
  cb8b7a88
- Revert code changes to mutltihead_attn tests · 51783cc7
  hubertlu-tw authored Aug 08, 2022
  
  51783cc7
05 Aug, 2022 1 commit

Enable FusedRMSNorm (#78) · c97ebfab

Hubert Lu authored Aug 05, 2022



* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)

* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix and generate docs for FusedRMSNorm (#1285)

* [FusedRMSNorm doc] document where epsilon is added (#1295)

* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording

* Fix some bugs

* Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs

* Fix NaN issues in FusedRMSNorm

* Update test_fused_layer_norm.py

* Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm

* Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
Co-authored-by: eqy <eddiey@nvidia.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

c97ebfab

29 Jul, 2022 5 commits
- Fix some compiling errors · 038ed999
  hubertlu-tw authored Jul 29, 2022
  
  038ed999
- Unskip run_transformer unit tests · bbf2c8d0
  hubertlu-tw authored Jul 29, 2022
  
  bbf2c8d0
- Merge remote-tracking branch 'upstream/master' into IFU-master-2022-07-29 · 795a5e5b
  hubertlu-tw authored Jul 29, 2022
  
  795a5e5b
- Merge remote-tracking branch 'origin/dev/hubertlu/FusedRMSNorm' · 016c8d4f
  hubertlu-tw authored Jul 29, 2022
  
  016c8d4f
- Update test_fused_layer_norm.py · 0df6c4c3
  hubertlu-tw authored Jul 29, 2022
  
  0df6c4c3
28 Jul, 2022 1 commit

Sequence parallel perf updates (#1437) · 3c19f106

Eric Harper authored Jul 28, 2022



* use _all_gather_base
Signed-off-by: ericharper <complex451@gmail.com>

* use _reduce_scatter_base
Signed-off-by: ericharper <complex451@gmail.com>

* remove torch empty in backward
Signed-off-by: ericharper <complex451@gmail.com>

* check self.attn_mask_type
Signed-off-by: ericharper <complex451@gmail.com>

* remove extra arg
Signed-off-by: ericharper <complex451@gmail.com>

* update get_tensor_shapes logic
Signed-off-by: ericharper <complex451@gmail.com>

3c19f106

26 Jul, 2022 1 commit

Improvements in distributed Adam optimizer for Megatron (#1432) · 2e025ab5

Tim Moon authored Jul 26, 2022

* Improvements in distributed Adam optimizer for Megatron

Add option to allocate gradient buckets out of one large buffer. Add option to initialize params in user-provided order. Perform communication when saving optimizer state. Support param sync with any dtype.

* Style fixes in distributed Adam helper classes

Review suggestions from @crcrpar

2e025ab5