Unverified Commit 17af7a0f authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: disagg router docs update (#4093)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent defe5de7
......@@ -118,23 +118,27 @@ python -m dynamo.frontend --help
For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md).
> [!Note]
> If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
>
> ```bash
> python -m dynamo.frontend \
> --router-mode kv \
> --http-port 8000 \
> --no-kv-events
> ```
#### Disaggregated Serving with Automatic Prefill Routing
When you launch prefill workers using `run_engines.sh --prefill`, the frontend automatically detects them and activates an internal prefill router. This prefill router:
- Automatically routes initial token processing to dedicated prefill workers
- Uses KV-aware routing regardless of the frontend's `--router-mode` setting
- Uses the same routing mode as the frontend's `--router-mode` setting
- Seamlessly integrates with your decode workers for token generation
No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details.
**Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
```bash
python -m dynamo.frontend \
--router-mode kv \
--http-port 8000 \
--no-kv-events
```
> [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)
### Step 3: Verify Setup
......
......@@ -65,13 +65,14 @@ The prefill router is automatically created when:
2. A prefill worker is detected with the same model name and `ModelType.Prefill`
**Key characteristics of the prefill router:**
- **Always uses KV-aware routing** regardless of the frontend's `--router-mode` setting
- **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode
- **Seamlessly integrated** into the request pipeline between preprocessing and decode routing
- **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available
### Setup Example
When both workers are registered, requests are automatically routed.
```python
# Decode worker registration (in your decode worker)
await register_llm(
......@@ -92,12 +93,8 @@ await register_llm(
)
```
When both workers are registered, requests are automatically routed:
1. **Prefill phase** → Prefill router selects best prefill worker (KV-aware)
2. **Decode phase** → Decode router selects decode worker (uses frontend's `--router-mode`)
> [!Note]
> **WIP**: Currently, the prefill router always uses KV routing. Future updates will provide more fine-grained control over prefill routing behavior to match user-specified frontend router modes.
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
## Overview
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment