Use TMA to optimize internode combine. (#287)
* Let forwarders use a dedicated SM
* Shuffle rdma idx
* Sender use TMA.
* Adjust the tuning chunk size.
* Modify NVL chunk layout.
* Update some combine config.
* Small lint
* Minor fix
* Overlap TMA store
---------
Co-authored-by:
Chenggang Zhao <chenggangz@deepseek.com>
Showing
This diff is collapsed.
Please register or sign in to comment