Fused Attention Support 64-bit Ragged Offsets for Large THD Tensors (#1230)
* Use 64-bit offsets for cuDNN 9.5+
* Align workspace tensors to 16B.
* Fix bug where std::accumulate overflowed on large tensor shapes.
* Only support 64-bit offsets on arbitrary sequence length fp16 backend.
Signed-off-by:
Michael Goldfarb <mgoldfarb@nvidia.com>
Showing
Please register or sign in to comment