- 30 May, 2020 4 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 29 May, 2020 3 commits
-
-
Kevin Stephano authored
-
Burc Eryilmaz authored
Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add additive mask support, separate Q/K/V parameters (#854) Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
Kexin Yu authored
make FusedLAMB async
-
- 28 May, 2020 1 commit
-
-
Max V. Irgiznov authored
-
- 27 May, 2020 1 commit
-
-
Kevin Stephano authored
Update Softmax in multihead attention to use the Current Cuda Stream instead of the Default Cuda Stream. (#843) * Adding C++ Multihead Attention implementation to contrib. * Add reference test that at least works for forward. * Remove CublasLt support from multihead attention. * Add new Python version of self attention. * Update python model of MHA with backward pass. * Fixed Output Linear connection in MHA. * Clean up compiles and add documentation to PySelfAttention. * Add Encdec Python version of multihead attention. Cleanup files. * Tests for self and encdec multihead attention. * Add reference pytorch implementation of attention with norm and add. * Add cutlass branch definition. * Add cutlass download to compile. * Add norm/add tests. * Add biases to pytorch python versions. * Add tests and fix issues with python version of attention masking. * Create README.md * Update README.md * Update README.md * Update perf test parameters. * Update README.md * Update README.md * Update README.md * Add files via upload * Update README.md * Update README.md * Update README.md * Fix matmul1 output tensor size. Fix tests that missed issue. * Allow for Z dimensions of 64K and greater on batched GEMMs. * remove redundant imports * general cleanup, remove deprecated or unused functions * Update Multihead Attention's softmax to use the Current Stream instead of the default stream. * Fix setup.py that got messed up in merge with upstream. * Update Multihead Attention strided batched gemms to use the current stream instead of the default. Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 23 May, 2020 1 commit
-
-
Kexin Yu authored
-
- 22 May, 2020 7 commits
-
-
Kexin Yu authored
-
Kexin Yu authored
-
Thor Johnsen authored
Bug fix
-
Thor Johnsen authored
-
Kexin Yu authored
-
Kexin Yu authored
-
Kexin Yu authored
-
- 21 May, 2020 1 commit
-
-
Kexin Yu authored
-
- 19 May, 2020 1 commit
-
-
Kexin Yu authored
Use global gradient clipping in FusedLAMB & add option for using NVLAMB
-
- 14 May, 2020 1 commit
-
-
Andrew Tulloch authored
-
- 13 May, 2020 1 commit
-
-
Andrew Sears authored
Signed-off-by:asears <asears@users.noreply.github.com>
-
- 12 May, 2020 2 commits
-
-
Thor Johnsen authored
Reversible fused adam with mt support
-
Thor Johnsen authored
-
- 08 May, 2020 1 commit
-
-
Thor Johnsen authored
-
- 07 May, 2020 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 06 May, 2020 3 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 05 May, 2020 1 commit
-
-
Thor Johnsen authored
-
- 04 May, 2020 1 commit
-
-
Thor Johnsen authored
-
- 02 May, 2020 3 commits
- 01 May, 2020 4 commits
-
-
https://github.com/NVIDIA/apexKexin Yu authored
-
Kexin Yu authored
-
Deyu Fu authored
* Changes to make xentropysoftmax load/store vectorized when possible: Increase default ILP so that each thread handle 16 Bytes data in one step Make thread load/store longest vector possible Make unroll case handle adjacent data instead of strided, so same order compare to vector case * Add shift for not aligned case. Remove less than 16 bytes aligned access
-
Kexin Yu authored
-
- 30 Apr, 2020 2 commits