- 23 May, 2026 1 commit
-
-
lijian authored
Signed-off-by:lijian <34831075+lijian0711@users.noreply.github.com>
-
- 13 May, 2026 1 commit
-
-
lijian authored
Signed-off-by:lijian <34831075+lijian0711@users.noreply.github.com>
-
- 23 Mar, 2026 1 commit
-
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
- 23 Jan, 2026 1 commit
-
-
lishen authored
-
- 15 Jan, 2026 1 commit
-
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
- 11 Dec, 2025 1 commit
-
-
lijian6 authored
End of hash, 'a' means rocshmem, 'b' means nvshmem. Signed-off-by:lijian <lijian6@sugon.com>
-
- 24 Oct, 2025 1 commit
-
-
lijian6 authored
2. 修改测试脚本,降低显存占用。使用量从17G -> 8G. Signed-off-by:lijian <lijian6@sugon.com>
-
- 17 Oct, 2025 1 commit
-
-
lijian6 authored
Signed-off-by:lijian <lijian6@sugon.com>
-
- 24 Sep, 2025 1 commit
-
-
Tailing Yuan authored
Co-authored-by:Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
-
- 10 Sep, 2025 2 commits
-
-
Chenggang Zhao authored
-
Chenggang Zhao authored
-
- 05 Aug, 2025 1 commit
-
-
windreamer authored
-
- 31 Jul, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 15 Jul, 2025 3 commits
-
-
Seth Howell authored
Responding to review comments. Signed-off-by:Seth Howell <sethh@nvidia.com>
-
Seth Howell authored
Signed-off-by:Seth Howell <sethh@nvidia.com>
-
Seth Howell authored
Signed-off-by:Seth Howell <sethh@nvidia.com>
-
- 12 Jul, 2025 1 commit
-
-
Seth Howell authored
NVSHMEM 3.3 and above support the host-side features in the patch. Note: Removed recv queue support Signed-off-by:Seth Howell <sethh@nvidia.com>
-
- 04 Jul, 2025 1 commit
-
-
Shangyan Zhou authored
* Add TMA buffer allocation * Use TMA for forwarders and NVL receivers * Use lane 31 to operate TMA. * Change rdma buffer layout. * Use TMA to transfer scales also. * Increase the NVL recv buffer size. * Disable early stopping. * Apply similar optimizations on receiver warps. * Prevent warp divergence. * Disable aggressive ptx by default. * Revert using TMA to transfer scales. * Format. * Change the layout of dispatch NVL buffer. * Move topk transformation to recv warps. * Use TMA to transfer all data in foward warps * Use TMA to store scales. * Code lint --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 27 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
-
- 11 Jun, 2025 1 commit
-
-
Chenggang Zhao authored
* Update README * Update `setup.py` * Fix headers * Add `DISABLE_NVSHMEM` for APIs * Fix launch * Fix TMA settings * Fix TMA usages * Fix dlink * Separate layout kernels * Update version * Add `is_sm90_compiled` * Fix tests * Add NVLink connection checks * Update README * Fix tests * Add some comments * Minor fix * Minor fix * Fix bugs
-
- 19 May, 2025 1 commit
-
-
guyueh1 authored
* Add 10.0 to TORCH_CUDA_ARCH_LIST Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * Revert csrc/CMakeLists change; in setup.py make TORCH_CUDA_ARCH_LIST configurable Signed-off-by:
Guyue Huang <guyueh@nvidia.com> --------- Signed-off-by:
Guyue Huang <guyueh@nvidia.com>
-
- 25 Feb, 2025 1 commit
-
-
Chenggang Zhao authored
-