- 28 May, 2020 1 commit
-
-
Max V. Irgiznov authored
-
- 27 May, 2020 1 commit
-
-
Kevin Stephano authored
Update Softmax in multihead attention to use the Current Cuda Stream instead of the Default Cuda Stream. (#843) * Adding C++ Multihead Attention implementation to contrib. * Add reference test that at least works for forward. * Remove CublasLt support from multihead attention. * Add new Python version of self attention. * Update python model of MHA with backward pass. * Fixed Output Linear connection in MHA. * Clean up compiles and add documentation to PySelfAttention. * Add Encdec Python version of multihead attention. Cleanup files. * Tests for self and encdec multihead attention. * Add reference pytorch implementation of attention with norm and add. * Add cutlass branch definition. * Add cutlass download to compile. * Add norm/add tests. * Add biases to pytorch python versions. * Add tests and fix issues with python version of attention masking. * Create README.md * Update README.md * Update README.md * Update perf test parameters. * Update README.md * Update README.md * Update README.md * Add files via upload * Update README.md * Update README.md * Update README.md * Fix matmul1 output tensor size. Fix tests that missed issue. * Allow for Z dimensions of 64K and greater on batched GEMMs. * remove redundant imports * general cleanup, remove deprecated or unused functions * Update Multihead Attention's softmax to use the Current Stream instead of the default stream. * Fix setup.py that got messed up in merge with upstream. * Update Multihead Attention strided batched gemms to use the current stream instead of the default. Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 22 May, 2020 2 commits
-
-
Thor Johnsen authored
Bug fix
-
Thor Johnsen authored
-
- 19 May, 2020 1 commit
-
-
Kexin Yu authored
Use global gradient clipping in FusedLAMB & add option for using NVLAMB
-
- 14 May, 2020 1 commit
-
-
Andrew Tulloch authored
-
- 13 May, 2020 1 commit
-
-
Andrew Sears authored
Signed-off-by:asears <asears@users.noreply.github.com>
-
- 12 May, 2020 2 commits
-
-
Thor Johnsen authored
Reversible fused adam with mt support
-
Thor Johnsen authored
-
- 08 May, 2020 1 commit
-
-
Thor Johnsen authored
-
- 07 May, 2020 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 06 May, 2020 3 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 05 May, 2020 1 commit
-
-
Thor Johnsen authored
-
- 04 May, 2020 1 commit
-
-
Thor Johnsen authored
-
- 02 May, 2020 3 commits
- 01 May, 2020 4 commits
-
-
https://github.com/NVIDIA/apexKexin Yu authored
-
Kexin Yu authored
-
Deyu Fu authored
* Changes to make xentropysoftmax load/store vectorized when possible: Increase default ILP so that each thread handle 16 Bytes data in one step Make thread load/store longest vector possible Make unroll case handle adjacent data instead of strided, so same order compare to vector case * Add shift for not aligned case. Remove less than 16 bytes aligned access
-
Kexin Yu authored
-
- 30 Apr, 2020 6 commits
-
-
Kexin Yu authored
-
Deyu Fu authored
* modify MTA axpby for wider load/store * Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
-
Deyu Fu authored
* update fused bias relu backward kernel * adding support for not require first layer dgrad * fix bug: wrong layer in requires grad * add infrastructure for optional bias and activation, currently only support no bias and no relu * make bias and relu optional separately * add sigmoid activation option
-
Burc Eryilmaz authored
Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 29 Apr, 2020 5 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 28 Apr, 2020 1 commit
-
-
Kexin Yu authored
-
- 23 Apr, 2020 1 commit
-
-
ptrblck authored
* add CUDAGenerator guard * fix generator_flag * add guards for gen pointer/ref issue * change mutex_ to mutex() * add check_generator Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 22 Apr, 2020 2 commits
-
-
Deyu Fu authored
-
Vinicius Reis authored
The LARC optimizer wraps an underlying optimizer and then needs to be passed to amp.initialize for mixed precision. There were 3 different crashes happening in this situation, fix all of them and add a unit test. I don't know if the 'LARC' in sys.modules check ever worked. In my setup, the entry in sys.modules is 'apex.parallel.LARC'. Checking if the variable is defined seems more reliable though.
-
- 20 Apr, 2020 1 commit
-
-
Kexin Yu authored
add additional loop for lists of params in FP16_Optimizer's load_state_dict
-