- 02 Jul, 2025 2 commits
-
-
Chenggang Zhao authored
-
ruizhang1230 authored
* support hidden size 8192 * refactor code * fix assert
-
- 27 Jun, 2025 6 commits
-
-
alpha-baby authored
-
alpha-baby authored
-
Chenggang Zhao authored
-
Shangyan Zhou authored
* Remove memory fence in NVLink barrier. * Move `__syncthread` and fence into barrier. * Fix bugs --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Shangyan Zhou authored
-
Chenggang Zhao authored
-
- 26 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 25 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
* Support bias. * Fix. * Fix style.
-
- 24 Jun, 2025 3 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
* Add draft * Add fast-debugging flags * Fix several bugs * Add sender timeout checks * Fix stuck * Fix bugs * Fix bugs
-
Shangyan Zhou authored
* Increase the test round. * Add warp synchronization. * Shuffle the send warps. * Add time elapsed into bench result.
-
- 23 Jun, 2025 1 commit
-
-
fzyzcjy authored
-
- 20 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 18 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
* Fix the tail loading issue. * Modify the sync offset.
-
- 16 Jun, 2025 4 commits
-
-
Shangyan Zhou authored
* Fix warp synchronization. * Another fix.
-
Chenggang Zhao authored
-
Chenggang Zhao authored
* Add automatic warp count control for low-latency dispatch * Add automatic warp count control for low-latency combine * More assertions
-
fzyzcjy authored
-
- 13 Jun, 2025 2 commits
-
-
Shangyan Zhou authored
-
Zhicheng Wu authored
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
-
- 12 Jun, 2025 1 commit
-
-
Shifang Xu authored
-
- 11 Jun, 2025 2 commits
-
-
Chenggang Zhao authored
* Update README * Update `setup.py` * Fix headers * Add `DISABLE_NVSHMEM` for APIs * Fix launch * Fix TMA settings * Fix TMA usages * Fix dlink * Separate layout kernels * Update version * Add `is_sm90_compiled` * Fix tests * Add NVLink connection checks * Update README * Fix tests * Add some comments * Minor fix * Minor fix * Fix bugs
-
Chenggang Zhao authored
-
- 10 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Fully remove FIFO slots * Fully remove FIFO buffers * Minor fix styles * Fix some typos * Bugs fixed * Cleanup `ibgda_poll_cq`
-
- 09 Jun, 2025 3 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
* Add low-latency kernel usage flag * Update comments
-
Chenggang Zhao authored
-
- 06 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait
-
- 03 Jun, 2025 1 commit
-
-
wzc.wuzhicheng authored
Signed-off-by:wzc.wuzhicheng <wzc.wuzhicheng@linux.alibaba.com>
-
- 28 May, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 23 May, 2025 2 commits
-
-
Chenggang Zhao authored
-
cywork121 authored
-
- 12 May, 2025 1 commit
-
-
sleepcoo authored
Co-authored-by:
zhyncs <me@zhyncs.com> Co-authored-by:
yinfan98 <1106310035@qq.com>
-
- 10 May, 2025 1 commit
-
-
wangfakang authored
To mitigate incast congestion, shuffle the starting index of target rank for different ranks and channels Signed-off-by:wangfakang <fakangwang@gmail.com>
-
- 22 Apr, 2025 4 commits
-
-
Shangyan Zhou authored
-
Chenggang Zhao authored
-
Shangyan Zhou authored
-
Shangyan Zhou authored
-