[Docs] Document the extra memory footprint overhead when using EPLB (#24537)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

[Docs] Document the extra memory footprint overhead when using EPLB (#24537)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
8b83b937 · Tyler Michael Smith · GitHub · 9dbefd88 · 8b83b937
Unverified Commit 8b83b937 authored Sep 10, 2025 by Tyler Michael Smith Committed by GitHub Sep 10, 2025
Show whitespace changes
Inline Side-by-side

Showing with 7 additions and 0 deletions

docs/serving/expert_parallel_deployment.md docs/serving/expert_parallel_deployment.md +7 -0

No files found.
--- a/docs/serving/expert_parallel_deployment.md
+++ b/docs/serving/expert_parallel_deployment.md
@@ -156,6 +156,13 @@ vllm serve Qwen/Qwen3-30B-A3B \
 - **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts
 - **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts
+### Memory Footprint Overhead
+EPLB uses redundant experts to that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium.
+This overhead equals `NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS`.
+For DeepSeekV3, this is approximately `2.4 GB` for one redundant expert per rank.
 ### Example Command
 Single node deployment with EPLB enabled: