When EP is enabled, MoE layers use expert parallelism instead of tensor parallelism, while attention layers continue to use tensor parallelism if `TP_SIZE > 1`.
### Example Command
### Example Command
The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
...
@@ -119,9 +121,6 @@ While MoE models are typically trained so that each expert receives a similar nu
...
@@ -119,9 +121,6 @@ While MoE models are typically trained so that each expert receives a similar nu
Enable EPLB with the `--enable-eplb` flag.
Enable EPLB with the `--enable-eplb` flag.
!!! note "Model Support"
Currently only DeepSeek V3 architecture is supported.
When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
### EPLB Parameters
### EPLB Parameters
...
@@ -134,6 +133,8 @@ Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. T
...
@@ -134,6 +133,8 @@ Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. T
| `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 |
| `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 |
| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |
| `policy` | The policy type for expert parallel load balancing | `"default"` |