• Kevin Stephano's avatar
    Change to Multihead Attention to allow Batched GEMMs larger than 64K. (#728) · 1733946a
    Kevin Stephano authored
    * Adding C++ Multihead Attention implementation to contrib.
    
    * Add reference test that at least works for forward.
    
    * Remove CublasLt support from multihead attention.
    
    * Add new Python version of self attention.
    
    * Update python model of MHA with backward pass.
    
    * Fixed Output Linear connection in MHA.
    
    * Clean up compiles and add documentation to PySelfAttention.
    
    * Add Encdec Python version of multihead attention.  Cleanup files.
    
    * Tests for self and encdec multihead attention.
    
    * Add reference pytorch implementation of attention with norm and add.
    
    * Add cutlass branch definition.
    
    * Add cutlass download to compile.
    
    * Add norm/add tests.
    
    * Add biases to pytorch python versions.
    
    * Add tests and fix issues with python version of attention masking.
    
    * Create README.md
    
    * Update README.md
    
    * Update README.md
    
    * Update perf test parameters.
    
    * Update README.md
    
    * Update README.md
    
    * Update README.md
    
    * Add files via upload
    
    * Update README.md
    
    * Update README.md
    
    * Update README.md
    
    * Fix matmul1 output tensor size.  Fix tests that missed issue.
    
    * Allow for Z dimensions of 64K and greater on batched GEMMs.
    
    * remove redundant imports
    
    * general cleanup, remove deprecated or unused functions
    1733946a
setup.py 20.9 KB