"vscode:/vscode.git/clone" did not exist on "abb8dd57f8a86a71b5f8fe1f059aee3636a658b1"
- 02 May, 2020 1 commit
-
-
Kexin Yu authored
-
- 01 May, 2020 4 commits
-
-
https://github.com/NVIDIA/apexKexin Yu authored
-
Kexin Yu authored
-
Deyu Fu authored
* Changes to make xentropysoftmax load/store vectorized when possible: Increase default ILP so that each thread handle 16 Bytes data in one step Make thread load/store longest vector possible Make unroll case handle adjacent data instead of strided, so same order compare to vector case * Add shift for not aligned case. Remove less than 16 bytes aligned access
-
Kexin Yu authored
-
- 30 Apr, 2020 4 commits
-
-
Kexin Yu authored
-
Deyu Fu authored
* modify MTA axpby for wider load/store * Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
-
Deyu Fu authored
* update fused bias relu backward kernel * adding support for not require first layer dgrad * fix bug: wrong layer in requires grad * add infrastructure for optional bias and activation, currently only support no bias and no relu * make bias and relu optional separately * add sigmoid activation option
-
Burc Eryilmaz authored
Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 28 Apr, 2020 1 commit
-
-
Kexin Yu authored
-
- 23 Apr, 2020 1 commit
-
-
ptrblck authored
* add CUDAGenerator guard * fix generator_flag * add guards for gen pointer/ref issue * change mutex_ to mutex() * add check_generator Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 22 Apr, 2020 2 commits
-
-
Deyu Fu authored
-
Vinicius Reis authored
The LARC optimizer wraps an underlying optimizer and then needs to be passed to amp.initialize for mixed precision. There were 3 different crashes happening in this situation, fix all of them and add a unit test. I don't know if the 'LARC' in sys.modules check ever worked. In my setup, the entry in sys.modules is 'apex.parallel.LARC'. Checking if the variable is defined seems more reliable though.
-
- 20 Apr, 2020 2 commits
- 13 Apr, 2020 1 commit
-
-
Mannat Singh authored
-
- 05 Apr, 2020 2 commits
- 03 Apr, 2020 4 commits
- 02 Apr, 2020 1 commit
-
-
Kexin Yu authored
-
- 01 Apr, 2020 2 commits
- 31 Mar, 2020 2 commits
-
-
Kexin Yu authored
-
Jeff Bowles authored
-
- 25 Mar, 2020 1 commit
-
-
msbaines authored
The cuda kernel used by fused-adam was using the default stream on the default device. The kernel needs use the same device as the parameter tensor. Fixed by using context manager to set correct default device. For the use_mt case, raised an error. Alternatively, the use_mt case could launch one kernel per cuda device. The non-contrib version will also need to be fixed. Co-authored-by:Mandeep Singh Baines <msb@fb.com>
-
- 23 Mar, 2020 2 commits
- 21 Mar, 2020 2 commits
- 20 Mar, 2020 3 commits
- 17 Mar, 2020 2 commits
- 11 Mar, 2020 2 commits
-
-
ptrblck authored
* disable ninja for multihead_attn * fix getCurrentStream in multihead_attn Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
Tomasz Grel authored
* Do not unscale the gradients if loss scale equal to 1 * Disable unscaling loss scale == 1 only for static scaling
-
- 02 Mar, 2020 1 commit
-