- 14 Nov, 2022 1 commit
-
-
flyingdown authored
-
- 11 Nov, 2022 1 commit
-
-
flyingdown authored
-
- 08 Nov, 2022 2 commits
-
-
flyingdown authored
-
flyingdown authored
-
- 21 Sep, 2022 1 commit
-
-
Hubert Lu authored
* Make index_mul_2d extension backward compatible for Atomic header include * Typo Co-authored-by:Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
-
- 19 Sep, 2022 1 commit
-
-
Hubert Lu authored
* Remove redundant import's and enable ninja for MHA extension * Remove redundant CUDAExtension import's
-
- 08 Sep, 2022 4 commits
-
-
Jithun Nair authored
Enable --focal_loss and --index_mul_2d extensions for ROCm
-
Jithun Nair authored
-
Hubert Lu authored
* Enable --transducer extension for ROCm * Enable --transducer unit tests for ROCm * Skip some failing tests in test_transducer_joint.py * Skip test_transducer_joint_pack for transducer extension * Keep transducer extension CUDA-compatible
-
Jithun Nair authored
Enable --peer_memory and --nccl p2p extensions for ROCm
-
- 07 Sep, 2022 2 commits
-
-
hubertlu-tw authored
-
hubertlu-tw authored
-
- 26 Aug, 2022 1 commit
-
-
Hubert Lu authored
* Handle len(cached_x.grad_fn.next_functions) == 1 in cached_cast * Unskip the unit tests related to len(cached_x.grad_fn.next_functions) == 1 Co-authored-by:David Fan <jiafa@microsoft.com>
-
- 23 Aug, 2022 2 commits
-
-
hubertlu-tw authored
-
hanbao authored
Co-authored-by:Han Bao <hbao@nvidia.com>
-
- 22 Aug, 2022 2 commits
-
-
Thor Johnsen authored
-
hubertlu-tw authored
-
- 15 Aug, 2022 1 commit
-
-
Jithun Nair authored
IFU-master-2022-07-29
-
- 10 Aug, 2022 1 commit
-
-
hubertlu-tw authored
-
- 09 Aug, 2022 7 commits
-
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
- 08 Aug, 2022 6 commits
-
-
hubertlu-tw authored
-
hubertlu-tw authored
-
Hubert Lu authored
* Skip the failing unit tests from the FusedRMSNorm PR * Update test_lamb.py Co-authored-by:Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
- 05 Aug, 2022 1 commit
-
-
Hubert Lu authored
* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) * FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> * fix and generate docs for FusedRMSNorm (#1285) * [FusedRMSNorm doc] document where epsilon is added (#1295) * [FusedRMSNorm doc] add epsilon to formula * correct * better wording * Fix some bugs * Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs * Fix NaN issues in FusedRMSNorm * Update test_fused_layer_norm.py * Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm * Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize Co-authored-by:
eqy <eddiey@nvidia.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Stas Bekman <stas00@users.noreply.github.com>
-
- 29 Jul, 2022 5 commits
-
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
hubertlu-tw authored
-
- 28 Jul, 2022 1 commit
-
-
Eric Harper authored
* use _all_gather_base Signed-off-by:
ericharper <complex451@gmail.com> * use _reduce_scatter_base Signed-off-by:
ericharper <complex451@gmail.com> * remove torch empty in backward Signed-off-by:
ericharper <complex451@gmail.com> * check self.attn_mask_type Signed-off-by:
ericharper <complex451@gmail.com> * remove extra arg Signed-off-by:
ericharper <complex451@gmail.com> * update get_tensor_shapes logic Signed-off-by:
ericharper <complex451@gmail.com>
-
- 26 Jul, 2022 1 commit
-
-
Tim Moon authored
* Improvements in distributed Adam optimizer for Megatron Add option to allocate gradient buckets out of one large buffer. Add option to initialize params in user-provided order. Perform communication when saving optimizer state. Support param sync with any dtype. * Style fixes in distributed Adam helper classes Review suggestions from @crcrpar
-