Commit 7ce8da4e authored by Chenggang Zhao's avatar Chenggang Zhao
Browse files

Minor fixes

parent ed3444bf
...@@ -82,7 +82,7 @@ NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install ...@@ -82,7 +82,7 @@ NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install
- `NVSHMEM_DIR`: the path to the NVSHMEM directory, disable all internode and low-latency features if not specified - `NVSHMEM_DIR`: the path to the NVSHMEM directory, disable all internode and low-latency features if not specified
- `DISABLE_SM90_FEATURES`: 0 or 1, whether to disable SM90 features, it is required for SM90 devices or CUDA 11 - `DISABLE_SM90_FEATURES`: 0 or 1, whether to disable SM90 features, it is required for SM90 devices or CUDA 11
- `TORCH_CUDA_ARCH_LIST`: the list of target architectures, e.g. `TORCH_CUDA_ARCH_LIST="9.0"` - `TORCH_CUDA_ARCH_LIST`: the list of target architectures, e.g. `TORCH_CUDA_ARCH_LIST="9.0"`
- `DISABLE_AGGRESSIVE_PTX_INSTRS`: 0 or 1, whether to disable aggressive load/store instructions, see [Undefine behavior PTX usage](#undefined-behavior-ptx-usage) for more details - `DISABLE_AGGRESSIVE_PTX_INSTRS`: 0 or 1, whether to disable aggressive load/store instructions, see [Undefined-behavior PTX usage](#undefined-behavior-ptx-usage) for more details
Then, import `deep_ep` in your Python project, and enjoy! Then, import `deep_ep` in your Python project, and enjoy!
......
...@@ -563,16 +563,16 @@ dispatch(int4* recv_x, float* recv_x_scales, int64_t* recv_topk_idx, float* recv ...@@ -563,16 +563,16 @@ dispatch(int4* recv_x, float* recv_x_scales, int64_t* recv_topk_idx, float* recv
auto window = rdma_send_channel_window[lane_id]; auto window = rdma_send_channel_window[lane_id];
auto latest_tail = rdma_send_channel_tail[lane_id]; auto latest_tail = rdma_send_channel_tail[lane_id];
auto offset = rdma_tail_idx - latest_tail; auto offset = rdma_tail_idx - latest_tail;
while (offset >= 32) { while (offset >= 32) {
release_lock(rdma_send_channel_lock + lane_id); release_lock(rdma_send_channel_lock + lane_id);
acquire_lock(rdma_send_channel_lock + lane_id); acquire_lock(rdma_send_channel_lock + lane_id);
latest_tail = rdma_send_channel_tail[lane_id]; latest_tail = rdma_send_channel_tail[lane_id];
offset = rdma_tail_idx - latest_tail; offset = rdma_tail_idx - latest_tail;
} }
// Release the transaction slot // Release the transaction slot
// Erase bit and move the ones if possible // Add the bit and move the ones if possible
window ^= 1u << offset; window |= 1u << offset;
if (offset == 0) { if (offset == 0) {
auto num_empty_slots = (~window) == 0 ? 32 : __ffs(~window) - 1; auto num_empty_slots = (~window) == 0 ? 32 : __ffs(~window) - 1;
st_release_cta(rdma_send_channel_tail + lane_id, latest_tail + num_empty_slots); st_release_cta(rdma_send_channel_tail + lane_id, latest_tail + num_empty_slots);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment