Use TMA to optimize internode dispatch. (#276)
* Add TMA buffer allocation
* Use TMA for forwarders and NVL receivers
* Use lane 31 to operate TMA.
* Change rdma buffer layout.
* Use TMA to transfer scales also.
* Increase the NVL recv buffer size.
* Disable early stopping.
* Apply similar optimizations on receiver warps.
* Prevent warp divergence.
* Disable aggressive ptx by default.
* Revert using TMA to transfer scales.
* Format.
* Change the layout of dispatch NVL buffer.
* Move topk transformation to recv warps.
* Use TMA to transfer all data in foward warps
* Use TMA to store scales.
* Code lint
---------
Co-authored-by:
Chenggang Zhao <chenggangz@deepseek.com>
Showing
This diff is collapsed.
Please register or sign in to comment