[Docs] Update EPLB docs (#30426)

Signed-off-by: mgoin <mgoin64@gmail.com>

[Docs] Update EPLB docs (#30426)
Signed-off-by: mgoin <mgoin64@gmail.com>
fcb89422 · Michael Goin · GitHub · 6ccb7bae · fcb89422
Unverified Commit fcb89422 authored Dec 10, 2025 by Michael Goin Committed by GitHub Dec 10, 2025
Show whitespace changes
Inline Side-by-side

Showing with 5 additions and 4 deletions

docs/serving/expert_parallel_deployment.md docs/serving/expert_parallel_deployment.md +5 -4

No files found.
--- a/docs/serving/expert_parallel_deployment.md
+++ b/docs/serving/expert_parallel_deployment.md
@@ -40,10 +40,12 @@ EP_SIZE = TP_SIZE × DP_SIZE
 Where:
- `TP_SIZE`: Tensor parallel size (always 1 for now)
+- `TP_SIZE`: Tensor parallel size
 - `DP_SIZE`: Data parallel size
 - `EP_SIZE`: Expert parallel size (computed automatically)
+When EP is enabled, MoE layers use expert parallelism instead of tensor parallelism, while attention layers continue to use tensor parallelism if `TP_SIZE > 1`.
 ### Example Command
 The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
@@ -119,9 +121,6 @@ While MoE models are typically trained so that each expert receives a similar nu
 Enable EPLB with the `--enable-eplb` flag.
-!!! note "Model Support"
-    Currently only DeepSeek V3 architecture is supported.
 When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
 ### EPLB Parameters
@@ -134,6 +133,8 @@ Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. T
 | `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 |
 | `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
 | `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
+| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |
+| `policy` | The policy type for expert parallel load balancing | `"default"` |
 For example: