- 18 Sep, 2023 1 commit
-
-
flyingdown authored
-
- 12 Jun, 2023 1 commit
-
-
flyingdown authored
2.添加环境变量APEX_ROCBLAS_GEMM_ALLOW_HALF用于控制是否使用fp16r 3.添加dcu版本信息 whl包名修改 readme更新安装步骤
-
- 23 Apr, 2023 1 commit
-
-
luise.chen authored
* Add fused_lars optimizer * Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW * Add flow of using nesterov in FusedLARS
-
- 23 Mar, 2023 1 commit
-
-
luise.chen authored
* Add fused_lars optimizer * Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW * Add flow of using nesterov in FusedLARS
-
- 11 Nov, 2022 1 commit
-
-
flyingdown authored
-
- 08 Nov, 2022 2 commits
-
-
flyingdown authored
-
flyingdown authored
-
- 21 Sep, 2022 1 commit
-
-
Hubert Lu authored
* Make index_mul_2d extension backward compatible for Atomic header include * Typo Co-authored-by:Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
-
- 19 Sep, 2022 1 commit
-
-
Hubert Lu authored
* Remove redundant import's and enable ninja for MHA extension * Remove redundant CUDAExtension import's
-
- 08 Sep, 2022 1 commit
-
-
Hubert Lu authored
* Enable --transducer extension for ROCm * Enable --transducer unit tests for ROCm * Skip some failing tests in test_transducer_joint.py * Skip test_transducer_joint_pack for transducer extension * Keep transducer extension CUDA-compatible
-
- 07 Sep, 2022 1 commit
-
-
hubertlu-tw authored
-
- 23 Aug, 2022 2 commits
-
-
hubertlu-tw authored
-
hanbao authored
Co-authored-by:Han Bao <hbao@nvidia.com>
-
- 22 Aug, 2022 1 commit
-
-
hubertlu-tw authored
-
- 07 Jul, 2022 1 commit
-
-
Masaki Kozuki authored
* remove pyprof * remove reparameterization * remove pyprof test * clean up
-
- 31 May, 2022 1 commit
-
-
Hubert Lu authored
* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming * Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement
-
- 21 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* guard * update * remove unnecessary version guard * runtime version guard * cosmetic * skip tests appropriately
-
- 19 Apr, 2022 1 commit
-
-
Masaki Kozuki authored
* bump version * add guard * fix the cond
-
- 15 Apr, 2022 1 commit
-
-
Hubert Lu authored
* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda * Comment out CUDA-specific implementations * Resolve filename collision of *cpp files with to-hipify code and *cu files
-
- 13 Apr, 2022 1 commit
-
-
Hubert Lu authored
* Faster `--fast_multihead_attn` build (#1245) * merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm * Fix some bugs Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
- 06 Apr, 2022 1 commit
-
-
Hubert Lu authored
Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74) * First attempt to make rocblas flag backward compatible * Fix some bugs * Fix some bugs * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions * Add groupbn extension unit tests for ROCm * Fix some bugs
-
- 05 Apr, 2022 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 30 Mar, 2022 1 commit
-
-
Gil Shomron authored
* Enabled Conv-Bias-ReLU fusion The following modules are enabled using cuDNN runtime fusion: 1) Conv-Bias-ReLU (+backward) 2) Conv-Bias (+backward) 3) Conv-Bias-Mask-ReLU (+backward) * Casts cleanup and autocast in unittest - Remove redundant dtype casts - Simulate the usage in the unittest by using torch.cuda.amp.autocast Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com> * Fixed save_for_backward Co-authored-by:
Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by:
root <root@luna-0277.selene.nvidia.com>
-
- 25 Mar, 2022 1 commit
-
-
Thor Johnsen authored
-
- 24 Mar, 2022 1 commit
-
-
Masaki Kozuki authored
Take-over of #1097 * Add fast CUDA focal loss implementation. * Enable fast math for CUDA focal loss. * Correct typo. * replace deprecated macros * Add fast CUDA focal loss implementation. * Enable fast math for CUDA focal loss. * Correct typo. * replace deprecated macros * TORCH_CUDA_CHECK -> AT_CUDA_CHECK The former is defined in torch/csrc/profiler/cuda.cpp so it's not available usually. The latter however is defined in ATen/cuda/Exceptions.h as an alias of C10_CUDA_CHECK. * add test * clean up * guard for torchvision Co-authored-by:Wil Kong <alpha0422@gmail.com>
-
- 23 Mar, 2022 1 commit
-
-
Thor Johnsen authored
-
- 11 Mar, 2022 1 commit
-
-
Pruthvi Madugundu authored
-
- 27 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 26 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
* fuse grad accumulation w/ weight grad Co-authored-by:
Sangkug Lym <slym@nvidia.com> * fp32 training path * not using *args, **kwargs * backward: moved the tensor dimension cnversion Co-authored-by:
Sangkug Lym <slym@nvidia.com> * move files to csrc/megatron * fix fp32 path * fix typo * add to in order to select the correct custom extension * fix typo * comment on import guard * update test: enable gradient_accumulation_fusion * 86 * remove redundant call of `test_column_parallel_linear` Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 10 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 01 Feb, 2022 1 commit
-
-
ChongyuNVIDIA authored
* Add the permutation related support as the extension for asp lib. * [Fix] Track the permutation sequence for progressive channel swap strategy. * Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings. * Fix the deprecated functions in ASP unit tests. * Fix the sparsity info typo in ASP lib. * [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search. * Update the README.md with identical random seed setting and NeurIPS info. * Integrate the Pybind11 enhancement of permutation search into ASP lib.
-
- 28 Jan, 2022 1 commit
-
-
Jithun Nair authored
-
- 19 Jan, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 13 Jan, 2022 1 commit
-
-
Shintaro Iwasaki authored
-
- 16 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
-
- 15 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
* apply formatter & remove duplicate func def * dry CUDA_HOME None check * `--threads 4`
-
- 14 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
* merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm
-
- 09 Dec, 2021 2 commits
-
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-