Update Softmax in multihead attention to use the Current Cuda Stream instead...
Update Softmax in multihead attention to use the Current Cuda Stream instead of the Default Cuda Stream. (#843)
* Adding C++ Multihead Attention implementation to contrib.
* Add reference test that at least works for forward.
* Remove CublasLt support from multihead attention.
* Add new Python version of self attention.
* Update python model of MHA with backward pass.
* Fixed Output Linear connection in MHA.
* Clean up compiles and add documentation to PySelfAttention.
* Add Encdec Python version of multihead attention. Cleanup files.
* Tests for self and encdec multihead attention.
* Add reference pytorch implementation of attention with norm and add.
* Add cutlass branch definition.
* Add cutlass download to compile.
* Add norm/add tests.
* Add biases to pytorch python versions.
* Add tests and fix issues with python version of attention masking.
* Create README.md
* Update README.md
* Update README.md
* Update perf test parameters.
* Update README.md
* Update README.md
* Update README.md
* Add files via upload
* Update README.md
* Update README.md
* Update README.md
* Fix matmul1 output tensor size. Fix tests that missed issue.
* Allow for Z dimensions of 64K and greater on batched GEMMs.
* remove redundant imports
* general cleanup, remove deprecated or unused functions
* Update Multihead Attention's softmax to use the Current Stream instead of the default stream.
* Fix setup.py that got messed up in merge with upstream.
* Update Multihead Attention strided batched gemms to use the current stream instead of the default.
Co-authored-by:
pbialecki <pbialecki@nvidia.com>
Showing
Please register or sign in to comment