- 23 Feb, 2021 1 commit
-
-
yjk21 authored
-
- 01 Dec, 2020 1 commit
-
-
Kexin Yu authored
DistributedFusedAdam Model Parallelism Support (Megatron) Co-authored-by:
Kexin Yu <kexiny@nvidia.com> Co-authored-by:
Kexin Yu <kexinznzn@gmail.com>
-
- 10 Aug, 2020 1 commit
-
-
ptrblck authored
Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 01 Aug, 2020 1 commit
-
-
ptrblck authored
-
- 01 Jun, 2020 1 commit
-
-
mcarilli authored
Co-authored-by:Michael Carilli <mcarilli@nvidia.com>
-
- 30 May, 2020 2 commits
-
-
Thor Johnsen authored
-
Thor Johnsen authored
-
- 29 May, 2020 1 commit
-
-
Burc Eryilmaz authored
Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add additive mask support, separate Q/K/V parameters (#854) Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 14 May, 2020 1 commit
-
-
Andrew Tulloch authored
-
- 23 Apr, 2020 1 commit
-
-
ptrblck authored
* add CUDAGenerator guard * fix generator_flag * add guards for gen pointer/ref issue * change mutex_ to mutex() * add check_generator Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 22 Apr, 2020 1 commit
-
-
Deyu Fu authored
-
- 23 Mar, 2020 1 commit
-
-
Kexin Yu authored
-
- 20 Mar, 2020 2 commits
- 11 Mar, 2020 1 commit
-
-
ptrblck authored
* disable ninja for multihead_attn * fix getCurrentStream in multihead_attn Co-authored-by:pbialecki <pbialecki@nvidia.com>
-
- 02 Mar, 2020 1 commit
-
- 27 Feb, 2020 1 commit
-
-
mcarilli authored
* NHWC support for multi tensor apply * compilation fix for version<=1.4
-
- 25 Feb, 2020 2 commits
- 24 Feb, 2020 1 commit
-
-
Kevin Stephano authored
* Adding C++ Multihead Attention implementation to contrib. * Add reference test that at least works for forward. * Remove CublasLt support from multihead attention. * Add new Python version of self attention. * Update python model of MHA with backward pass. * Fixed Output Linear connection in MHA. * Clean up compiles and add documentation to PySelfAttention. * Add Encdec Python version of multihead attention. Cleanup files. * Tests for self and encdec multihead attention. * Add reference pytorch implementation of attention with norm and add. * Add cutlass branch definition. * Add cutlass download to compile. * Add norm/add tests. * Add biases to pytorch python versions. * Add tests and fix issues with python version of attention masking. * Create README.md * Update README.md * Update README.md * Update perf test parameters. * Update README.md * Update README.md * Update README.md * Add files via upload * Update README.md * Update README.md * Update README.md * Fix matmul1 output tensor size. Fix tests that missed issue. * Allow for Z dimensions of 64K and greater on batched GEMMs. * remove redundant imports * general cleanup, remove deprecated or unused functions
-
- 15 Feb, 2020 1 commit
-
-
Deyu Fu authored
-
- 06 Feb, 2020 1 commit
-
-
Kevin Stephano authored
* Adding C++ Multihead Attention implementation to contrib. * Add reference test that at least works for forward. * Remove CublasLt support from multihead attention. * Add new Python version of self attention. * Update python model of MHA with backward pass. * Fixed Output Linear connection in MHA. * Clean up compiles and add documentation to PySelfAttention. * Add Encdec Python version of multihead attention. Cleanup files. * Tests for self and encdec multihead attention. * Add reference pytorch implementation of attention with norm and add. * Add cutlass branch definition. * Add cutlass download to compile. * Add norm/add tests. * Add biases to pytorch python versions. * Add tests and fix issues with python version of attention masking. * Create README.md * Update README.md * Update README.md * Update perf test parameters. * Update README.md * Update README.md * Update README.md * Add f...
-
- 21 Jan, 2020 1 commit
-
-
jjsjann123 authored
-
- 08 Jan, 2020 1 commit
-
-
ptrblck authored
* add WAR for pip>=19.3.1 * remove pipmain, use extras_require instead
-
- 04 Oct, 2019 1 commit
-
-
Deyu Fu authored
* move previous fused_adam and fp16_optimizer to contrib * make build contrib.fused_adam optional * change build option name * remove unnecessary try import
-
- 13 Sep, 2019 1 commit
-
-
mcarilli authored
-
- 06 Sep, 2019 1 commit
-
-
mcarilli authored
* Pushing for build tests * Contrib files * Removing deprecated checks
-
- 17 Aug, 2019 1 commit
-
-
Deyu Fu authored
-
- 16 Aug, 2019 1 commit
-
-
Deyu Fu authored
-
- 13 Aug, 2019 1 commit
-
-
Marek Kolodziej authored
Co-authored-by:
Aditya Agrawal <aditya.iitb@gmail.com> Co-authored-by:
Marek Kolodziej <mkolod@gmail.com>
-
- 08 Aug, 2019 1 commit
-
-
Deyu Fu authored
-
- 31 May, 2019 1 commit
-
-
Thor Johnsen authored
* First draft, for discussion * Fix mistakes in LAMB equations * Add loop over chunk * Bug fix * Bug fix * Bug fix * Undo bug fix * Bug fix * Add multi tensor LAMB optimizer to setup.py * Rename step_size to learning_rate * Fix compilation errors
-
- 23 May, 2019 1 commit
-
-
Michael Carilli authored
-
- 22 May, 2019 1 commit
-
-
mcarilli authored
-
- 09 May, 2019 1 commit
-
-
Wil Kong authored
* Add softmax cross entropy loss with label smoothing support. * Fix deprecation of AT_DISPATCH_XXX and several minor issues. * Fix issues commented by reviewers. * Add FB license. * Remove code generation constraints. * Add a simple unittest for label smoothing.
-
- 27 Apr, 2019 1 commit
-
-
jjsjann123 authored
* Persistent group batchnorm added Added persistent grouped batch norm for performance run on strong scaling case: currently only supporting: 1. nhwc layout 2. fp16 3. synchronization only within a node! Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage by the persistent kernel. Documentation and examples will follow. * updating type().scalarType() to scalar_type() * moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm * fixing the cta computation * review comment: set device_id through cudaGetDevice() move cudaMemset to cudaMemsetAsync updated __threadfence() to __threadfence_system() inter device write
-
- 18 Apr, 2019 1 commit
-
-
Michael Carilli authored
-
- 09 Apr, 2019 1 commit
-
-
Michael Carilli authored
-
- 23 Mar, 2019 1 commit
-
-
Cubbee authored
-
- 22 Mar, 2019 1 commit
-
-
mcarilli authored
* Adding Torch + bare-metal nvcc version check and container build tests * Putting a canary in the coalmine * canary proved elusive * Trying direct setup.py install * this should work * Removing canary * hopefully this works
-