Unverified Commit fcb89422 authored by Michael Goin's avatar Michael Goin Committed by GitHub
Browse files

[Docs] Update EPLB docs (#30426)


Signed-off-by: default avatarmgoin <mgoin64@gmail.com>
parent 6ccb7bae
...@@ -40,10 +40,12 @@ EP_SIZE = TP_SIZE × DP_SIZE ...@@ -40,10 +40,12 @@ EP_SIZE = TP_SIZE × DP_SIZE
Where: Where:
- `TP_SIZE`: Tensor parallel size (always 1 for now) - `TP_SIZE`: Tensor parallel size
- `DP_SIZE`: Data parallel size - `DP_SIZE`: Data parallel size
- `EP_SIZE`: Expert parallel size (computed automatically) - `EP_SIZE`: Expert parallel size (computed automatically)
When EP is enabled, MoE layers use expert parallelism instead of tensor parallelism, while attention layers continue to use tensor parallelism if `TP_SIZE > 1`.
### Example Command ### Example Command
The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section. The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
...@@ -119,9 +121,6 @@ While MoE models are typically trained so that each expert receives a similar nu ...@@ -119,9 +121,6 @@ While MoE models are typically trained so that each expert receives a similar nu
Enable EPLB with the `--enable-eplb` flag. Enable EPLB with the `--enable-eplb` flag.
!!! note "Model Support"
Currently only DeepSeek V3 architecture is supported.
When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution. When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
### EPLB Parameters ### EPLB Parameters
...@@ -134,6 +133,8 @@ Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. T ...@@ -134,6 +133,8 @@ Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. T
| `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 | | `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 |
| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` | | `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` | | `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |
| `policy` | The policy type for expert parallel load balancing | `"default"` |
For example: For example:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment