- 31 Aug, 2021 3 commits
-
-
Thor Johnsen authored
Spatially Distributed Fast Bottleneck block
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 20 Aug, 2021 1 commit
-
-
X Wang authored
-
- 17 Jul, 2021 2 commits
-
-
Nan Zheng authored
* Added support for fused ReLU and dropout into transducer joint * Reorganized code selection path in transducer joint fwd * Added support for fused ReLU+dropout into transducer joint * Vectorize transducer loss backward with fused softmax (#3) * Nanz/transducer loss (#4) * Vectorize transducer loss backward with fused softmax * Added a predicate to avoid potential IMA * Nanz/transducer loss (#5) * Vectorize transducer loss backward with fused softmax * Added a predicate to avoid potentional IMA * Added more predicates to avoid IMAs * Updated documentations for newly added features. * Fixed a error in transducer.py
-
yjk21 authored
-
- 16 Jul, 2021 1 commit
-
-
X Wang authored
* local_rank and install cuda version fix
-
- 15 Jun, 2021 2 commits
- 26 May, 2021 1 commit
-
-
Kexin Yu authored
* clip before reduce scatter * provide clip before/after RS option * change to clip after ar (avoid confusion) * fix comments
-
- 17 May, 2021 1 commit
-
-
Burc Eryilmaz authored
Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 19 Apr, 2021 1 commit
-
-
Burc Eryilmaz authored
* don't create cublasLt handle, fix zero block size case * cleanup
-
- 17 Apr, 2021 3 commits
-
-
Burc Eryilmaz authored
* initial cublaslt support * 64 bit input * add license headers * cleanup * remove license Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
ptrblck authored
-
Deyu Fu authored
* initial commit for adding fast bottleneck * sync cudnn-frontend module Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 16 Apr, 2021 1 commit
-
-
yjk21 authored
-
- 15 Apr, 2021 3 commits
-
-
Jay Rodge authored
Fixed a typo
-
Kexin Yu authored
* enable no_copy * barrier for SHARP * set verbose=False by default Co-authored-by:Kexin Yu <kexiny@nvidia.com>
-
Sudhakar Singh authored
* Add unit tests for fused-novograd * Fix: tensors should reside on the same device * Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test * fixed issues mentioned in the comments
-
- 24 Mar, 2021 2 commits
-
-
Kexin Yu authored
* sync free Distributed LAMB * init lr with provided value * wait l2 norm strem * reorder param * fix indent Co-authored-by:Kexin Yu <kexiny@nvidia.com>
-
Nan Zheng authored
* Initial check-in of the transducer extension. * Added more comments to help explain the code * Corrected minor typos * 1. Renamed variable in tests to match the extension 2. Disabled ninja build option
-
- 23 Feb, 2021 1 commit
-
-
yjk21 authored
-
- 10 Feb, 2021 1 commit
-
-
Shoufa Chen authored
* copy-paste friendly * fix import container_abcs issue Nightly PyTorch has removed `container_abcs` from `torch._six`. https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35 * fix import container_abcs issue Nightly PyTorch has removed `container_abcs` from `torch._six`. https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35 * keep existing for pytorch1.7 and earlier
-
- 20 Jan, 2021 1 commit
-
-
Burc Eryilmaz authored
Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 17 Dec, 2020 2 commits
-
-
Thor Johnsen authored
Update ASP README to highlight default recipe
-
jpool-nv authored
The Recipe was presented after some non-standard API calls, so moving the suggested usage up, giving it its own section, and reinforcing the suggested usage in the non-standard section.
-
- 04 Dec, 2020 3 commits
-
-
Stas Bekman authored
-
Kexin Yu authored
* add flag for DistributedAdam: step_support_amp_scaling Co-authored-by:
Kexin Yu <kexiny@nvidia.com> Co-authored-by:
Kexin Yu <kexinznzn@gmail.com>
-
Burc Eryilmaz authored
* fuse dropout into softmax in fprop for additive mask case
-
- 02 Dec, 2020 1 commit
-
-
Janusz Lisiecki authored
- resume() is a nested function and when it loads best_prec1 it creates a local variable that hides the one from the parent function (which refers to the global one). This PR adds `global` to modify the global variable as intended Signed-off-by:Janusz Lisiecki <jlisiecki@nvidia.com>
-
- 01 Dec, 2020 1 commit
-
-
Kexin Yu authored
DistributedFusedAdam Model Parallelism Support (Megatron) Co-authored-by:
Kexin Yu <kexiny@nvidia.com> Co-authored-by:
Kexin Yu <kexinznzn@gmail.com>
-
- 19 Oct, 2020 1 commit
-
-
lly-zero-one authored
In this PR, we mainly tried to optimize the performance of Syncatchnorm and also fixed one potential issue in the welford_parallel kernel implementation. For performance improvement, we batched the mean/var/count all_gather communication together and sent it once in the forward path We also batch the all_reduce in backward path We add the contiguous call on the input of welford_parallel kernel. If there is any standard perf benchmark, I would be happy to run it.
-
- 29 Sep, 2020 1 commit
-
-
ptrblck authored
-
- 15 Sep, 2020 1 commit
-
-
Thor Johnsen authored
Update asp readme
-
- 14 Sep, 2020 2 commits
- 15 Aug, 2020 1 commit
-
-
mcarilli authored
-
- 10 Aug, 2020 1 commit
-
-
ptrblck authored
Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 06 Aug, 2020 1 commit
-
-
ngimel authored
-
- 05 Aug, 2020 1 commit
-
-
ngimel authored
* add device guards to the optimizers * add untracked file * set deviceGuard in multi_tensor_apply * address review comments; fix lamb * indent * typo
-