Unverified Commit 8b83b937 authored by Tyler Michael Smith's avatar Tyler Michael Smith Committed by GitHub
Browse files

[Docs] Document the extra memory footprint overhead when using EPLB (#24537)


Signed-off-by: default avatarTyler Michael Smith <tyler@neuralmagic.com>
parent 9dbefd88
...@@ -156,6 +156,13 @@ vllm serve Qwen/Qwen3-30B-A3B \ ...@@ -156,6 +156,13 @@ vllm serve Qwen/Qwen3-30B-A3B \
- **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts - **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts
- **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts - **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts
### Memory Footprint Overhead
EPLB uses redundant experts to that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium.
This overhead equals `NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS`.
For DeepSeekV3, this is approximately `2.4 GB` for one redundant expert per rank.
### Example Command ### Example Command
Single node deployment with EPLB enabled: Single node deployment with EPLB enabled:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment