For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/architecture/kv_cache_routing.md).
#### Launching a Standalone Router for Prefill Workers (Optional)
#### Disaggregated Serving with Automatic Prefill Routing
If you're using disaggregated serving with separate prefill and decode workers, you should also launch a standalone router for prefill workers. This router handles routing prefill requests to dedicated prefill workers. When using a standalone prefill router, it's recommended to start the frontend (decode router) with `--kv-overlap-score-weight 0` for pure load balancing (as prefix-aware routing is now handled by the standalone router):
When you launch prefill workers using `run_engines.sh --prefill`, the frontend automatically detects them and activates an internal prefill router. This prefill router:
- Automatically routes initial token processing to dedicated prefill workers
- Uses KV-aware routing regardless of the frontend's `--router-mode` setting
- Seamlessly integrates with your decode workers for token generation
```bash
# Start the decode router with pure load balancing
python -m dynamo.frontend \
--router-mode kv \
--router-reset-states\
--http-port 8000 \
--kv-overlap-score-weight 0
# In another terminal, start the standalone router for prefill workers
python -m dynamo.router \
--endpoint dynamo.prefill.generate \
--block-size 64 \
--router-reset-states\
--no-track-active-blocks
```
The `--router-reset-states` flag clears any previous state, and `--no-track-active-blocks` disables active block tracking (suitable for prefill-only routing where decode load is not relevant).
No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/architecture/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details.
**Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [KV Cache Routing documentation](/docs/architecture/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for the default setup.
>
> Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
See [`components/backends/vllm/launch/disagg_router.sh`](/components/backends/vllm/launch/disagg_router.sh) for a complete example.
@@ -54,6 +54,51 @@ The main KV-aware routing arguments:
For basic model registration without KV routing, you can use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
## Disaggregated Serving (Prefill and Decode)
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md#model-types)), the frontend automatically detects them and activates an internal prefill router.
### Automatic Prefill Router Activation
The prefill router is automatically created when:
1. A decode model is registered (e.g., via `register_llm()` with `ModelType.Chat | ModelType.Completions`)
2. A prefill worker is detected with the same model name and `ModelType.Prefill`
**Key characteristics of the prefill router:**
-**Always uses KV-aware routing** regardless of the frontend's `--router-mode` setting
-**Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode
-**Seamlessly integrated** into the request pipeline between preprocessing and decode routing
-**Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available
### Setup Example
```python
# Decode worker registration (in your decode worker)
awaitregister_llm(
model_input=ModelInput.Tokens,
model_type=ModelType.Chat|ModelType.Completions,
endpoint=generate_endpoint,
model_name="meta-llama/Llama-2-7b-hf",
# ... other parameters
)
# Prefill worker registration (in your prefill worker)
awaitregister_llm(
model_input=ModelInput.Tokens,
model_type=ModelType.Prefill,# <-- Mark as prefill worker
endpoint=generate_endpoint,
model_name="meta-llama/Llama-2-7b-hf",# Must match decode model name
# ... other parameters
)
```
When both workers are registered, requests are automatically routed:
1.**Prefill phase** → Prefill router selects best prefill worker (KV-aware)
> **WIP**: Currently, the prefill router always uses KV routing. Future updates will provide more fine-grained control over prefill routing behavior to match user-specified frontend router modes.
## Overview
The KV-aware router operates on two key principles to optimize request routing: