- 21 Jul, 2025 2 commits
-
-
Zhiyi Hu authored
Co-authored-by:zhiyi Hu <zhiyihu@U-NYQQMGK0-2250.local>
-
Shangyan Zhou authored
* Remove an outdated todo * Increase the number of combine forward warps. * forwarder use TMA. * Small fix * Code lint --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 17 Jul, 2025 1 commit
-
-
Guangguan authored
When dispatch/combine, neither sender nor receiver waits for the finish of the rdma channel head update, which may result in the remaining inflight head update wqes even after the kernel finished. Once the infight wqes arrive after the rdma channel head buffer cleaning for the next round of dispatch/combine, the rdma channel head buffer will be re- written to a none-zero value. The rdma sender can reuse the data buffer before the rdma receivers consume the date buffer because of the wrong rdma channel head, cauing date error and kernel hung. For performance considering, to overlap the inflight wqes' RTT, fix this issue by waiting for all previous inflight wqes to complete before cleaning rdma buffers in the next round of dispatch/combine. Signed-off-by:Guangguan Wang <guangguan.wang@linux.alibaba.com>
-
- 16 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
* Fix rdma head movement * Optimize `cached_notify` by using TMA. * Fix * Small fix
-
- 14 Jul, 2025 2 commits
-
-
Shangyan Zhou authored
-
Shangyan Zhou authored
* Strengthen the barrier in `cached_notify` * lint * Change the timing method * lint
-
- 10 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
* Let forwarders use a dedicated SM * Shuffle rdma idx * Sender use TMA. * Adjust the tuning chunk size. * Modify NVL chunk layout. * Update some combine config. * Small lint * Minor fix * Overlap TMA store --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 04 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
* Add TMA buffer allocation * Use TMA for forwarders and NVL receivers * Use lane 31 to operate TMA. * Change rdma buffer layout. * Use TMA to transfer scales also. * Increase the NVL recv buffer size. * Disable early stopping. * Apply similar optimizations on receiver warps. * Prevent warp divergence. * Disable aggressive ptx by default. * Revert using TMA to transfer scales. * Format. * Change the layout of dispatch NVL buffer. * Move topk transformation to recv warps. * Use TMA to transfer all data in foward warps * Use TMA to store scales. * Code lint --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 02 Jul, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 27 Jun, 2025 5 commits
-
-
alpha-baby authored
-
alpha-baby authored
-
Shangyan Zhou authored
* Remove memory fence in NVLink barrier. * Move `__syncthread` and fence into barrier. * Fix bugs --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Shangyan Zhou authored
-
Chenggang Zhao authored
-
- 26 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 25 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
* Support bias. * Fix. * Fix style.
-
- 24 Jun, 2025 2 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
* Add draft * Add fast-debugging flags * Fix several bugs * Add sender timeout checks * Fix stuck * Fix bugs * Fix bugs
-
- 18 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
* Fix the tail loading issue. * Modify the sync offset.
-
- 16 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
* Fix warp synchronization. * Another fix.
-
- 13 Jun, 2025 2 commits
-
-
Shangyan Zhou authored
-
Zhicheng Wu authored
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
-
- 12 Jun, 2025 1 commit
-
-
Shifang Xu authored
-
- 11 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Update README * Update `setup.py` * Fix headers * Add `DISABLE_NVSHMEM` for APIs * Fix launch * Fix TMA settings * Fix TMA usages * Fix dlink * Separate layout kernels * Update version * Add `is_sm90_compiled` * Fix tests * Add NVLink connection checks * Update README * Fix tests * Add some comments * Minor fix * Minor fix * Fix bugs
-
- 10 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Fully remove FIFO slots * Fully remove FIFO buffers * Minor fix styles * Fix some typos * Bugs fixed * Cleanup `ibgda_poll_cq`
-
- 03 Jun, 2025 1 commit
-
-
wzc.wuzhicheng authored
Signed-off-by:wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
-
- 28 May, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 10 May, 2025 1 commit
-
-
wangfakang authored
To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels Signed-off-by:wangfakang <fakangwang@gmail.com>
-
- 22 Apr, 2025 3 commits
-
-
Shangyan Zhou authored
-
Chenggang Zhao authored
-
Shangyan Zhou authored
-
- 21 Apr, 2025 2 commits
-
-
Shangyan Zhou authored
-
moningchen authored
In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios. In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD. Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.
-
- 28 Mar, 2025 2 commits
-
-
Chenggang Zhao authored
-
songhexiang authored
For the SMs which calculate metadata in notify_dispatch, each warp in the SM is used to calculate the metadata of one channel. The default configuration is 8 warps for 10 channels, which needs two rounds of loop. Maybe the number of warps can be configured to the number of the channels so that one loop is enough.
-
- 06 Mar, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 05 Mar, 2025 2 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
-
- 25 Feb, 2025 1 commit
-
-
Chenggang Zhao authored
-