- 19 Sep, 2023 1 commit
-
-
root authored
-
- 18 Sep, 2023 1 commit
-
-
flyingdown authored
-
- 06 Sep, 2023 1 commit
-
-
Pruthvi Madugundu authored
This reverts commit 8fc9b21f.
-
- 11 Aug, 2023 1 commit
-
-
Pruthvi Madugundu authored
-
- 12 Jun, 2023 1 commit
-
-
flyingdown authored
2.添加环境变量APEX_ROCBLAS_GEMM_ALLOW_HALF用于控制是否使用fp16r 3.添加dcu版本信息 whl包名修改 readme更新安装步骤
-
- 23 Apr, 2023 5 commits
-
-
Pruthvi Madugundu authored
-
Hubert Lu authored
* replace torch.Tensor with torch.empty (#1578) * replace torch.Tensor with torch.empty * nit * nit * torch.empty() must have args (#1584) * use `torch.tensor` to create a tensor with initializer values (#1588) * use `torch.tensor` with init values Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * Update apex/contrib/sparsity/sparse_masklib.py * remove torch._six Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * retire `torch._six` as per the upstream commit of `b005ec62b9`. Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * use std collections.abc Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by:
Nouamane Tazi <nouamane98@gmail.com> Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com>
-
luise.chen authored
* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50 * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
-
Pruthvi Madugundu authored
* Update register keyword handling for C++17 The keyword 'register' for storage class is removed in C++17, so keeping it active for only c++14 and lower. * Updates to the code
-
hubertlu-tw authored
-
- 30 Mar, 2023 1 commit
-
-
Pruthvi Madugundu authored
-
- 01 Mar, 2023 1 commit
-
-
Hubert Lu authored
* replace torch.Tensor with torch.empty (#1578) * replace torch.Tensor with torch.empty * nit * nit * torch.empty() must have args (#1584) * use `torch.tensor` to create a tensor with initializer values (#1588) * use `torch.tensor` with init values Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * Update apex/contrib/sparsity/sparse_masklib.py * remove torch._six Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * retire `torch._six` as per the upstream commit of `b005ec62b9`. Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * use std collections.abc Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by:
Nouamane Tazi <nouamane98@gmail.com> Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com>
-
- 13 Feb, 2023 1 commit
-
-
luise.chen authored
* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50 * GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50
-
- 20 Dec, 2022 1 commit
-
-
Pruthvi Madugundu authored
* Update register keyword handling for C++17 The keyword 'register' for storage class is removed in C++17, so keeping it active for only c++14 and lower. * Updates to the code
-
- 09 Dec, 2022 1 commit
-
-
hubertlu-tw authored
-
- 14 Nov, 2022 1 commit
-
-
flyingdown authored
-
- 08 Nov, 2022 1 commit
-
-
flyingdown authored
-
- 21 Sep, 2022 1 commit
-
-
Hubert Lu authored
* Make index_mul_2d extension backward compatible for Atomic header include * Typo Co-authored-by:Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
-
- 08 Sep, 2022 1 commit
-
-
Hubert Lu authored
* Enable --transducer extension for ROCm * Enable --transducer unit tests for ROCm * Skip some failing tests in test_transducer_joint.py * Skip test_transducer_joint_pack for transducer extension * Keep transducer extension CUDA-compatible
-
- 23 Aug, 2022 2 commits
-
-
hubertlu-tw authored
-
hanbao authored
Co-authored-by:Han Bao <hbao@nvidia.com>
-
- 22 Aug, 2022 2 commits
-
-
Thor Johnsen authored
-
hubertlu-tw authored
-
- 08 Aug, 2022 1 commit
-
-
hubertlu-tw authored
-
- 29 Jul, 2022 1 commit
-
-
hubertlu-tw authored
-
- 26 Jul, 2022 1 commit
-
-
Tim Moon authored
* Improvements in distributed Adam optimizer for Megatron Add option to allocate gradient buckets out of one large buffer. Add option to initialize params in user-provided order. Perform communication when saving optimizer state. Support param sync with any dtype. * Style fixes in distributed Adam helper classes Review suggestions from @crcrpar
-
- 21 Jul, 2022 1 commit
-
-
Thor Johnsen authored
-
- 14 Jul, 2022 1 commit
-
-
Masaki Kozuki authored
* follow the current signature Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * call .backward on outputs Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * update the other caller of _softmax_backward_data Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com>
-
- 05 Jul, 2022 1 commit
-
-
Tim Moon authored
* Add features to distributed Adam for Megatron support Support gradient clipping, gradient scaling, FP32 grad accumulation, and multiple dtypes and devices. * Restore closure arg to distributed Adam Review suggestion from @crcrpar
-
- 23 Jun, 2022 1 commit
-
-
Tim Moon authored
* Increase default bucket size in distributed Adam * Move distributed Adam unit test to contrib tests Integrate into unit testing framework * Tweak hyperparameters for dist Adam optimizer test Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar. * Use distributed test infrastructure in distributed Adam unit test Suggestion from @crcrpar.
-
- 22 Jun, 2022 1 commit
-
-
Tim Moon authored
* Gradient clipping routine with fused kernels Identical API as PyTorch. Falls back to PyTorch impl when not computing L2 norm. * Add unit test for gradient clipping * Add fp16 case to gradient clipping unit test * Tweaks to grad clipping unit test Review suggestions from @crcrpar * Debug gradient clipping tests When checking that incorrect results produce assertion errors, make sure to generate a discrepancy outside the range of numerical error.
-
- 16 Jun, 2022 1 commit
-
-
Kevin Stephano authored
Remove legacy fuser usage from multihead attention in contrib in favor of the default which should be nvfuser. Modify test scripts to activate fusion. (#1403)
-
- 14 Jun, 2022 1 commit
-
-
Tim Moon authored
Adjust test options to have tighter tolerances.
-
- 13 Jun, 2022 1 commit
-
-
Tim Moon authored
-
- 31 May, 2022 1 commit
-
-
Hubert Lu authored
* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming * Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement
-
- 29 Apr, 2022 1 commit
-
-
yjk21 authored
-
- 21 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* guard * update * remove unnecessary version guard * runtime version guard * cosmetic * skip tests appropriately
-
- 19 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* bump version * add guard * fix the cond
-
- 14 Apr, 2022 2 commits
-
-
mahathis authored
* Added suuport for memory format API(torch.channels_last) in GBN Group Batch Norm (GBN) is an NHWC operation. It assumes that the underlying memory format of an input tensor is NHWC. It originally does not support PyTorch's memory_format API. To support PyTorch's memory_format API, i.e., .to(memory_format=...) or .contiguous(memory_format=...), we add the torch_channels_last flag to indicate whether the workload adopts the PyTorch memory_format API by setting memory_format=torch.channels_last. This flag allows GBN to handle memory formats of input tensors properly. An example to use memory_format in GBN: """ from apex.contrib.groupbn.batch_norm import BatchNorm2d_NHWC GBN = BatchNorm2d_NHWC(planes, fuse_relu=True, bn_group=1, torch_channels_last=True) """ The cases that GBN handles are as follows: 1. torch_channels_last=True and input tensor's memory_format=torch.channels_last, GBN will generate the torch.channels_last output tensor. 2. torch_channels_last=True and input tensor's memory_format=torch.contiguous_format, GBN will convert the input tensor to torch.channels_last and will generate the torch.channels_last output tensor. 3. use_pytorch_channels_last=False and input tensor's memory_format=torch.contiguous_format, GBN will generate the torch.contiguous_format output tensor. * Add GBN unit tests for channel_last memory format Co-authored-by:hubertlu-tw <hubertlu@amd.com>
-
Thor Johnsen authored
-