Commit 2a3cac90 authored by Chenggang Zhao's avatar Chenggang Zhao
Browse files

Add some docs

parent c5b40405
...@@ -282,6 +282,10 @@ For two micro-batch overlapping, you can refer to the following figure. With our ...@@ -282,6 +282,10 @@ For two micro-batch overlapping, you can refer to the following figure. With our
## Notices ## Notices
#### Easier potential overall design
Current DeepEP implementation uses queues for communication buffers which saves memory but introduces complexity and potential deadlocks. If you're implementing your own version based on DeepEP, consider using fixed-size buffers allocated to maximum capacity for simplicity and better performance. For a detailed discussion of this alternative approach, see https://github.com/deepseek-ai/DeepEP/issues/39.
#### Undefined-behavior PTX usage #### Undefined-behavior PTX usage
- For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX `ld.global.nc.L1::no_allocate.L2::256B` to **read volatile data**. The PTX modifier `.nc` indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with `.L1::no_allocate` on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1. - For extreme performance, we discover and use an undefined-behavior PTX usage: using read-only PTX `ld.global.nc.L1::no_allocate.L2::256B` to **read volatile data**. The PTX modifier `.nc` indicates that a non-coherent cache is used. But the correctness is tested to be guaranteed with `.L1::no_allocate` on Hopper architectures, and performance will be much better. The reason we guess may be: the non-coherent cache is unified with L1, and the L1 modifier is not just a hint but a strong option, so that the correctness can be guaranteed by no dirty data in L1.
......
...@@ -44,7 +44,6 @@ class Buffer: ...@@ -44,7 +44,6 @@ class Buffer:
to the number of local experts. to the number of local experts.
""" """
# TODO: argument docs
# Initialize the CPP runtime # Initialize the CPP runtime
self.rank = group.rank() self.rank = group.rank()
self.group_size = group.size() self.group_size = group.size()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment