Our implementation uses Apex's [FMHA](https://github.com/NVIDIA/apex/tree/master/apex/contrib/csrc/fmha) code as a starting point. We thank [Young-jun Ko](https://yjk21.github.io/) for the in-depth explanation of his FMHA implementation and for his thoughtful answers to our questions about CUDA.