- 15 Sep, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 10 Sep, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 25 Aug, 2025 1 commit
-
-
sky authored
Signed-off-by:wangfakang <fakangwang@gmail.com>
-
- 19 Aug, 2025 1 commit
-
-
Tailing Yuan authored
Signed-off-by:Tailing Yuan <yuantailing@gmail.com>
-
- 30 Jul, 2025 1 commit
-
-
sky authored
* Add diagnosis module for precise identification of slow ranks Signed-off-by:
wangfakang <fakangwang@gmail.com> * Add tests for the slow diagnosis module Signed-off-by:
wangfakang <fakangwang@gmail.com> * Update some comments for diagnose Signed-off-by:
wangfakang <fakangwang@gmail.com> * Update test case for diagnose Signed-off-by:
wangfakang <fakangwang@gmail.com> * Strip the diagnose module, thx LyricZhao and sphish. Signed-off-by:
wangfakang <fakangwang@gmail.com> * update variable name and cumulative wait recv cost, thx sphish. Signed-off-by:
wangfakang <fakangwang@gmail.com> * remove invalid comments. Signed-off-by:
wangfakang <fakangwang@gmail.com> --------- Signed-off-by:
wangfakang <fakangwang@gmail.com>
-
- 22 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 21 Jul, 2025 1 commit
-
-
Guangguan Wang authored
* Add arg --pressure-test for test_low_latency.py Add arg --pressure-test for test_low_latency.py Signed-off-by:
Guangguan Wang <guangguan.wang@linux.alibaba.com> * Export NVSHMEM_QP_DEPTH Export NVSHMEM_QP_DEPTH Signed-off-by:
Guangguan Wang <guangguan.wang@linux.alibaba.com> --------- Signed-off-by:
Guangguan Wang <guangguan.wang@linux.alibaba.com>
-
- 15 Jul, 2025 1 commit
-
-
Seth Howell authored
This enables the CPU-Assisted data path. Signed-off-by:Seth Howell <sethh@nvidia.com>
-
- 11 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
* Explicitly destroy the C++ runtime and release resources. * Small fix * fix typo * Add a flag to control whether explicit `destroy()` is required.
-
- 10 Jul, 2025 2 commits
-
-
Shangyan Zhou authored
* Let forwarders use a dedicated SM * Shuffle rdma idx * Sender use TMA. * Adjust the tuning chunk size. * Modify NVL chunk layout. * Update some combine config. * Small lint * Minor fix * Overlap TMA store --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Chenggang Zhao authored
* Add LogFMT interface * Update comments * Add simulated code * Fix comments * Change to 128 channels * Add notes * Optimize performance * optimize simulate logfmt 10bit * Minor fix * Stronger low latency tests * Minor fix * Stronger low latency tests for logfmt * Optimize logfmt simulate: lg2/ex2 ptx, step_inv * Minor fix * Minor fix * Add non-logfmt bench * Fix value=0 for logfmt * Optimize performance * Refactor tests --------- Co-authored-by:Zhean Xu <xza@deepseek.com>
-
- 04 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 27 Jun, 2025 1 commit
-
-
Shangyan Zhou authored
-
- 25 Jun, 2025 2 commits
-
-
Shangyan Zhou authored
* Support bias. * Fix. * Fix style.
-
Shangyan Zhou authored
* Add `get_comm_stream`. * Fix style.
-
- 16 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 12 Jun, 2025 1 commit
-
-
Shifang Xu authored
-
- 11 Jun, 2025 4 commits
-
-
Chenggang Zhao authored
* Use `pynvml` to detect NVLink connections * Add a TODO * Add shutdown * Fix comments
-
Chenggang Zhao authored
* Update README * Update `setup.py` * Fix headers * Add `DISABLE_NVSHMEM` for APIs * Fix launch * Fix TMA settings * Fix TMA usages * Fix dlink * Separate layout kernels * Update version * Add `is_sm90_compiled` * Fix tests * Add NVLink connection checks * Update README * Fix tests * Add some comments * Minor fix * Minor fix * Fix bugs
-
Chenggang Zhao authored
-
Chenggang Zhao authored
-
- 09 Jun, 2025 2 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
* Add low-latency kernel usage flag * Update comments
-
- 07 Jun, 2025 1 commit
-
-
fzyzcjy authored
-
- 06 Jun, 2025 2 commits
-
-
Chenggang Zhao authored
* Update CMake files * Use TMA instead of LD/ST for intranode dispatch * Use TMA instead of LD/ST for intranode combine * Adjust configs * Test default configs as well * More warps for combine * Add inter-thread fence * Enable more warps * Do not use TMA for senders * Update configs * Remove useless wait
-
Shangyan Zhou authored
Co-authored-by:Shangyan Zhou <sy.zhou@deepseek.com>
-
- 23 May, 2025 3 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
-
cywork121 authored
-
- 22 Apr, 2025 2 commits
-
-
Chenggang Zhao authored
-
Shangyan Zhou authored
-
- 21 Apr, 2025 1 commit
-
-
moningchen authored
In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios. In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD. Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.
-
- 03 Apr, 2025 1 commit
-
-
fzyzcjy authored
-
- 25 Mar, 2025 1 commit
-
-
fzyzcjy authored
-
- 18 Mar, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 13 Mar, 2025 1 commit
-
-
Dmytro Dzhulgakov authored
-
- 10 Mar, 2025 2 commits
-
-
Dmytro Dzhulgakov authored
-
Chenggang Zhao authored
-
- 05 Mar, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 04 Mar, 2025 1 commit
-
-
Chenggang Zhao authored
-