- 08 Aug, 2022 1 commit
-
-
hubertlu-tw authored
-
- 29 Jul, 2022 1 commit
-
-
hubertlu-tw authored
-
- 26 Jul, 2022 1 commit
-
-
Tim Moon authored
* Improvements in distributed Adam optimizer for Megatron Add option to allocate gradient buckets out of one large buffer. Add option to initialize params in user-provided order. Perform communication when saving optimizer state. Support param sync with any dtype. * Style fixes in distributed Adam helper classes Review suggestions from @crcrpar
-
- 21 Jul, 2022 1 commit
-
-
Thor Johnsen authored
-
- 14 Jul, 2022 1 commit
-
-
Masaki Kozuki authored
* follow the current signature Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * call .backward on outputs Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com> * update the other caller of _softmax_backward_data Signed-off-by:
Masaki Kozuki <mkozuki@nvidia.com>
-
- 05 Jul, 2022 1 commit
-
-
Tim Moon authored
* Add features to distributed Adam for Megatron support Support gradient clipping, gradient scaling, FP32 grad accumulation, and multiple dtypes and devices. * Restore closure arg to distributed Adam Review suggestion from @crcrpar
-
- 23 Jun, 2022 1 commit
-
-
Tim Moon authored
* Increase default bucket size in distributed Adam * Move distributed Adam unit test to contrib tests Integrate into unit testing framework * Tweak hyperparameters for dist Adam optimizer test Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar. * Use distributed test infrastructure in distributed Adam unit test Suggestion from @crcrpar.
-
- 22 Jun, 2022 1 commit
-
-
Tim Moon authored
* Gradient clipping routine with fused kernels Identical API as PyTorch. Falls back to PyTorch impl when not computing L2 norm. * Add unit test for gradient clipping * Add fp16 case to gradient clipping unit test * Tweaks to grad clipping unit test Review suggestions from @crcrpar * Debug gradient clipping tests When checking that incorrect results produce assertion errors, make sure to generate a discrepancy outside the range of numerical error.
-
- 16 Jun, 2022 1 commit
-
-
Kevin Stephano authored
Remove legacy fuser usage from multihead attention in contrib in favor of the default which should be nvfuser. Modify test scripts to activate fusion. (#1403)
-
- 14 Jun, 2022 1 commit
-
-
Tim Moon authored
Adjust test options to have tighter tolerances.
-
- 13 Jun, 2022 1 commit
-
-
Tim Moon authored
-
- 31 May, 2022 1 commit
-
-
Hubert Lu authored
* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming * Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement
-
- 29 Apr, 2022 1 commit
-
-
yjk21 authored
-
- 21 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* guard * update * remove unnecessary version guard * runtime version guard * cosmetic * skip tests appropriately
-
- 19 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* bump version * add guard * fix the cond
-
- 14 Apr, 2022 2 commits
-
-
mahathis authored
* Added suuport for memory format API(torch.channels_last) in GBN Group Batch Norm (GBN) is an NHWC operation. It assumes that the underlying memory format of an input tensor is NHWC. It originally does not support PyTorch's memory_format API. To support PyTorch's memory_format API, i.e., .to(memory_format=...) or .contiguous(memory_format=...), we add the torch_channels_last flag to indicate whether the workload adopts the PyTorch memory_format API by setting memory_format=torch.channels_last. This flag allows GBN to handle memory formats of input tensors properly. An example to use memory_format in GBN: """ from apex.contrib.groupbn.batch_norm import BatchNorm2d_NHWC GBN = BatchNorm2d_NHWC(planes, fuse_relu=True, bn_group=1, torch_channels_last=True) """ The cases that GBN handles are as follows: 1. torch_channels_last=True and input tensor's memory_format=torch.channels_last, GBN will generate the torch.channels_last output tensor. 2. torch_channels_last=True and input tensor's memory_format=torch.contiguous_format, GBN will convert the input tensor to torch.channels_last and will generate the torch.channels_last output tensor. 3. use_pytorch_channels_last=False and input tensor's memory_format=torch.contiguous_format, GBN will generate the torch.contiguous_format output tensor. * Add GBN unit tests for channel_last memory format Co-authored-by:hubertlu-tw <hubertlu@amd.com>
-
Thor Johnsen authored
-
- 13 Apr, 2022 2 commits
-
-
Hubert Lu authored
* Faster `--fast_multihead_attn` build (#1245) * merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm * Fix some bugs Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
Thor Johnsen authored
-
- 08 Apr, 2022 3 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 06 Apr, 2022 1 commit
-
-
Hubert Lu authored
Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74) * First attempt to make rocblas flag backward compatible * Fix some bugs * Fix some bugs * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions * Add groupbn extension unit tests for ROCm * Fix some bugs
-
- 05 Apr, 2022 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 03 Apr, 2022 1 commit
-
-
Thor Johnsen authored
-
- 02 Apr, 2022 4 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 01 Apr, 2022 3 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 31 Mar, 2022 3 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 30 Mar, 2022 2 commits
-
-
Gil Shomron authored
* Enabled Conv-Bias-ReLU fusion The following modules are enabled using cuDNN runtime fusion: 1) Conv-Bias-ReLU (+backward) 2) Conv-Bias (+backward) 3) Conv-Bias-Mask-ReLU (+backward) * Casts cleanup and autocast in unittest - Remove redundant dtype casts - Simulate the usage in the unittest by using torch.cuda.amp.autocast Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com> * Fixed save_for_backward Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by:
root <root@luna-0277.selene.nvidia.com>
-
Thor Johnsen authored
-
- 29 Mar, 2022 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-