- 10 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 01 Feb, 2022 1 commit
-
-
ChongyuNVIDIA authored
* Add the permutation related support as the extension for asp lib. * [Fix] Track the permutation sequence for progressive channel swap strategy. * Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings. * Fix the deprecated functions in ASP unit tests. * Fix the sparsity info typo in ASP lib. * [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search. * Update the README.md with identical random seed setting and NeurIPS info. * Integrate the Pybind11 enhancement of permutation search into ASP lib.
-
- 19 Jan, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 13 Jan, 2022 1 commit
-
-
Shintaro Iwasaki authored
-
- 16 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
-
- 15 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
* apply formatter & remove duplicate func def * dry CUDA_HOME None check * `--threads 4`
-
- 14 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
* merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm
-
- 09 Dec, 2021 1 commit
-
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
- 27 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* Persistent LayerNorm: Multi-CTA Rewrite * autocast support Co-authored-by:Young-Jun Ko <youngjun.ko@gmail.com>
-
- 02 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
Co-authored-by:
Piotr Bialecki <pbialecki@nvidia.com> Co-authored-by:
Eddie Yan <eddiey@nvidia.com> Co-authored-by:
Rishi Puri <riship@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 08 Sep, 2021 1 commit
-
-
Masaki Kozuki authored
- passing include directories to `CUDAExtension`'s `include_dirs` argument - removing `-I/path/to/dir` arguments from `extra_compile_args`
-
- 01 Sep, 2021 2 commits
-
-
Burc Eryilmaz authored
* fuse norm into scale * add fused norm into dlamb Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
Burc Eryilmaz authored
* support for fused dense layer with cublasLt, fusion in both fprop and bprop * fix typo causing syntax error * add fused GEMM+gelu+GEMM modue * fix typo for workspace size * update cublas check for 11600 * add tests for fused dense layer * fix CUDA 10.x path Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 17 Jul, 2021 2 commits
-
-
Nan Zheng authored
* Added support for fused ReLU and dropout into transducer joint * Reorganized code selection path in transducer joint fwd * Added support for fused ReLU+dropout into transducer joint * Vectorize transducer loss backward with fused softmax (#3) * Nanz/transducer loss (#4) * Vectorize transducer loss backward with fused softmax * Added a predicate to avoid potential IMA * Nanz/transducer loss (#5) * Vectorize transducer loss backward with fused softmax * Added a predicate to avoid potentional IMA * Added more predicates to avoid IMAs * Updated documentations for newly added features. * Fixed a error in transducer.py
-
yjk21 authored
-
- 17 Apr, 2021 1 commit
-
-
Deyu Fu authored
* initial commit for adding fast bottleneck * sync cudnn-frontend module Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 16 Apr, 2021 1 commit
-
-
yjk21 authored
-
- 24 Mar, 2021 1 commit
-
-
Nan Zheng authored
* Initial check-in of the transducer extension. * Added more comments to help explain the code * Corrected minor typos * 1. Renamed variable in tests to match the extension 2. Disabled ninja build option
-
- 23 Feb, 2021 1 commit
-
-
yjk21 authored
-
- 01 Dec, 2020 1 commit
-
-
Kexin Yu authored
DistributedFusedAdam Model Parallelism Support (Megatron) Co-authored-by:
Kexin Yu <kexiny@nvidia.com> Co-authored-by:
Kexin Yu <kexinznzn@gmail.com>
-
- 10 Aug, 2020 1 commit
-
-
ptrblck authored
Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 01 Aug, 2020 1 commit
-
-
ptrblck authored
-
- 01 Jun, 2020 1 commit
-
-
mcarilli authored
Co-authored-by:Michael Carilli <mcarilli@nvidia.com>
-
- 30 May, 2020 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 29 May, 2020 1 commit
-
-
Burc Eryilmaz authored
Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add additive mask support, separate Q/K/V parameters (#854) Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 14 May, 2020 1 commit
-
-
Andrew Tulloch authored
-
- 23 Apr, 2020 1 commit
-
-
ptrblck authored
* add CUDAGenerator guard * fix generator_flag * add guards for gen pointer/ref issue * change mutex_ to mutex() * add check_generator Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 22 Apr, 2020 1 commit
-
-
Deyu Fu authored
-
- 23 Mar, 2020 1 commit
-
-
Kexin Yu authored
-
- 20 Mar, 2020 2 commits
- 11 Mar, 2020 1 commit
-
-
ptrblck authored
* disable ninja for multihead_attn * fix getCurrentStream in multihead_attn Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 02 Mar, 2020 1 commit
-
- 27 Feb, 2020 1 commit
-
-
mcarilli authored
* NHWC support for multi tensor apply * compilation fix for version<=1.4
-
- 25 Feb, 2020 2 commits
- 24 Feb, 2020 1 commit
-
-
Kevin Stephano authored
* Adding C++ Multihead Attention implementation to contrib. * Add reference test that at least works for forward. * Remove CublasLt support from multihead attention. * Add new Python version of self attention. * Update python model of MHA with backward pass. * Fixed Output Linear connection in MHA. * Clean up compiles and add documentation to PySelfAttention. * Add Encdec Python version of multihead attention. Cleanup files. * Tests for self and encdec multihead attention. * Add reference pytorch implementation of attention with norm and add. * Add cutlass branch definition. * Add cutlass download to compile. * Add norm/add tests. * Add biases to pytorch python versions. * Add tests and fix issues with python version of attention masking. * Create README.md * Update README.md * Update README.md * Update perf test parameters. * Update README.md * Update README.md * Update README.md * Add files via upload * Update README.md * Update README.md * Update README.md * Fix matmul1 output tensor size. Fix tests that missed issue. * Allow for Z dimensions of 64K and greater on batched GEMMs. * remove redundant imports * general cleanup, remove deprecated or unused functions
-
- 15 Feb, 2020 1 commit
-
-
Deyu Fu authored
-
- 06 Feb, 2020 1 commit
-
-
Kevin Stephano authored
* Adding C++ Multihead Attention implementation to contrib. * Add reference test that at least works for forward. * Remove CublasLt support from multihead attention. * Add new Python version of self attention. * Update python model of MHA with backward pass. * Fixed Output Linear connection in MHA. * Clean up compiles and add documentation to PySelfAttention. * Add Encdec Python version of multihead attention. Cleanup files. * Tests for self and encdec multihead attention. * Add reference pytorch implementation of attention with norm and add. * Add cutlass branch definition. * Add cutlass download to compile. * Add norm/add tests. * Add biases to pytorch python versions. * Add tests and fix issues with python version of attention masking. * Create README.md * Update README.md * Update README.md * Update perf test parameters. * Update README.md * Update README.md * Update README.md * Add f...
-