Commits · 3e54b78fd776108d04e959e6988e96f62d0314d8 · OpenDAS / DeepEP

22 Apr, 2025 2 commits
- Normal kernels always use IBGDA mode. · 3e54b78f
  Shangyan Zhou authored Apr 22, 2025
  
  3e54b78f
- Refactor some code. · 20b2aaaf
  Shangyan Zhou authored Apr 22, 2025
  
  20b2aaaf
21 Apr, 2025 4 commits

Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into... · c07fdd19
moningchen authored Apr 21, 2025
```
Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/DeepEP into trmt/internode_multi_qp
```
c07fdd19
Add the performance data after internode optimization in the Readme file · e0eaaf94
moningchen authored Apr 21, 2025

e0eaaf94
Revert `ibgda_device.cuh` and remove some comments. · e2c57848
Shangyan Zhou authored Apr 21, 2025

e2c57848

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data... · 5ab80c28

moningchen authored Apr 21, 2025

In the Internode Normal Kernel, when using nvshmem ibrc for RDMA data transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

5ab80c28

16 Apr, 2025 2 commits
- Merge pull request #124 from wplf/patch-1 · a84a2480
  Shangyan Zhou authored Apr 16, 2025
```
Fix typo in nvshmem.patch
```
  a84a2480
- Fix typo in nvshmem.patch · a2ccc95d
  李金梁 authored Apr 16, 2025
  
  a2ccc95d
14 Apr, 2025 2 commits
- Merge pull request #118 from andylin-hao/main · a0c69317
  Chenggang Zhao authored Apr 14, 2025
```
Fix test combine args
```
  a0c69317
- Merge pull request #119 from phantom5125/patch-1 · b9bb2bba
  Shangyan Zhou authored Apr 14, 2025
```
Fix typo in nvshmem.patch
```
  b9bb2bba
12 Apr, 2025 1 commit
- Fix typo in nvshmem.patch · 42f61708
  GreatHato authored Apr 13, 2025
  
  42f61708
11 Apr, 2025 2 commits
- Fix test combine args · 23c54150
  Hao Lin authored Apr 11, 2025
```
Signed-off-by: Hao Lin <linhaomails@gmail.com>
```
  23c54150
- Merge pull request #116 from alpha-baby/fix-test-result-not-output · 8a0ca8e2
  Chenggang Zhao authored Apr 11, 2025
```
fix: not output result in some linux system
```
  8a0ca8e2
10 Apr, 2025 1 commit
- fix: not output result in some linux system · 0f80da84
  fujianhao.fjh authored Apr 10, 2025
  
  0f80da84
07 Apr, 2025 1 commit
- Remove useless control metadata for low-latency combine · 42494864
  Chenggang Zhao authored Apr 07, 2025
  
  42494864
03 Apr, 2025 2 commits
- Merge pull request #108 from fzyzcjy/patch-2 · 2a0b3d7a
  Chenggang Zhao authored Apr 03, 2025
```
Super tiny fix shape typo
```
  2a0b3d7a
- Update buffer.py · 218c5a1f
  fzyzcjy authored Apr 03, 2025
  
  218c5a1f
28 Mar, 2025 4 commits
- Fix zero-copy mode tests · 26fa72d8
  Chenggang Zhao authored Mar 28, 2025
  
  26fa72d8
- Fix compilation · c4d12b4f
  Chenggang Zhao authored Mar 28, 2025
  
  c4d12b4f
- Merge pull request #96 from songhexiang/adjust_kNumThreads_of_notify_dispatch · dcf46f1c
  Chenggang Zhao authored Mar 28, 2025
```
Adjust kNumThreads of notify_dispatch
```
  dcf46f1c
- For the SMs which calculate metadata in notify_dispatch, each warp in the SM... · 4dd1e68a
  songhexiang authored Mar 28, 2025
```
For the SMs which calculate metadata in notify_dispatch, each warp in the SM is used to calculate the metadata of one channel. The default configuration is 8 warps for 10 channels, which needs two rounds of loop. Maybe the number of warps can be configured to the number of the channels so that one loop is enough.
```
  4dd1e68a
27 Mar, 2025 3 commits
- Remove NVLink low-latency plan · e130cc6e
  Chenggang Zhao authored Mar 27, 2025
  
  e130cc6e
- Update README · cbd92fd0
  Chenggang Zhao authored Mar 27, 2025
  
  cbd92fd0
- Stronger acquire scope for low-latency kernels · ffc39ba0
  Chenggang Zhao authored Mar 27, 2025
  
  ffc39ba0
25 Mar, 2025 3 commits
- Merge pull request #89 from fzyzcjy/patch-1 · 7d52ad72
  Chenggang Zhao authored Mar 25, 2025
```
Super tiny fix typo
```
  7d52ad72
- Remove confusing comments · ae0eafd2
  Chenggang Zhao authored Mar 25, 2025
  
  ae0eafd2
- Update buffer.py · 36b5c279
  fzyzcjy authored Mar 25, 2025
  
  36b5c279
18 Mar, 2025 3 commits
- Merge pull request #79 from deepseek-ai/zero-copy-combine · c4b8ffc3
  Chenggang Zhao authored Mar 18, 2025
```
Support zero-copy for low-latency combine
```
  c4b8ffc3
- Support zero-copy for low-latency combine · 66465476
  Chenggang Zhao authored Mar 18, 2025
  
  66465476
- Support zero-copy for low-latency combine · dcaf73e5
  Chenggang Zhao authored Mar 18, 2025
  
  dcaf73e5
14 Mar, 2025 4 commits
- Fix bugs for intranode EP kernels · 82dcf48f
  Chenggang Zhao authored Mar 14, 2025
  
  82dcf48f
- Merge pull request #73 from deepseek-ai/p2p-signal · 043fa5fa
  Chenggang Zhao authored Mar 14, 2025
```
Low latency kernels use rdma atomic to support AR
```
  043fa5fa
- Fix style. · 38cdaf39
  Shangyan Zhou authored Mar 14, 2025
  
  38cdaf39
- Low latency kernels use rdma atomic to support AR. · 2d0cf41d
  Shangyan Zhou authored Mar 14, 2025
  
  2d0cf41d
13 Mar, 2025 2 commits
- Merge pull request #66 from dzhulgakov/combine-out-arg · 7128ba3e
  Chenggang Zhao authored Mar 13, 2025
```
Allow passing output tensor in low_latency_combine
```
  7128ba3e
- comments · 50ac280a
  Dmytro Dzhulgakov authored Mar 13, 2025
  
  50ac280a
11 Mar, 2025 1 commit
- Merge pull request #67 from deepseek-ai/roce-support · 0008c675
  Chenggang Zhao authored Mar 11, 2025
```
Update NVSHMEM to v3.2.5.
```
  0008c675
10 Mar, 2025 2 commits
- Allow passing output tensor in low_latency_combine · b3b61ef5
  Dmytro Dzhulgakov authored Mar 10, 2025
  
  b3b61ef5
- Support BF16 for low-latency kernels · ed7487c1
  Chenggang Zhao authored Mar 10, 2025
  
  ed7487c1
06 Mar, 2025 1 commit
- Improve AR performance · 1fc40d50
  Chenggang Zhao authored Mar 06, 2025
  
  1fc40d50