docs: disagg router docs update (#4093)

Signed-off-by: PeaBrane <yanrpei@gmail.com>

docs: disagg router docs update (#4093)
Signed-off-by: PeaBrane <yanrpei@gmail.com>
17af7a0f · Yan Ru Pei · GitHub · defe5de7 · 17af7a0f · 17af7a0f
Unverified Commit 17af7a0f authored Nov 04, 2025 by Yan Ru Pei Committed by GitHub Nov 05, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 15 deletions

benchmarks/router/README.md benchmarks/router/README.md +13 -9

docs/router/kv_cache_routing.md docs/router/kv_cache_routing.md +3 -6

No files found.
--- a/benchmarks/router/README.md
+++ b/benchmarks/router/README.md
@@ -118,23 +118,27 @@ python -m dynamo.frontend --help

 For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md).

+> [!Note]
+> If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
+>
+> ```bash
+> python -m dynamo.frontend \
+>     --router-mode kv \
+>     --http-port 8000 \
+>     --no-kv-events
+> ```
+
 #### Disaggregated Serving with Automatic Prefill Routing

 When you launch prefill workers using `run_engines.sh --prefill`, the frontend automatically detects them and activates an internal prefill router. This prefill router:
 - Automatically routes initial token processing to dedicated prefill workers
- Uses KV-aware routing regardless of the frontend's `--router-mode` setting
+- Uses the same routing mode as the frontend's `--router-mode` setting
 - Seamlessly integrates with your decode workers for token generation

 No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details.

-**Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
-
-```bash
-python -m dynamo.frontend \
-    --router-mode kv \
-    --http-port 8000 \
-    --no-kv-events
-```
+> [!Note]
+> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)

 ### Step 3: Verify Setup


--- a/docs/router/kv_cache_routing.md
+++ b/docs/router/kv_cache_routing.md
@@ -65,13 +65,14 @@ The prefill router is automatically created when:
 2. A prefill worker is detected with the same model name and `ModelType.Prefill`

 **Key characteristics of the prefill router:**
- **Always uses KV-aware routing** regardless of the frontend's `--router-mode` setting
 - **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode
 - **Seamlessly integrated** into the request pipeline between preprocessing and decode routing
 - **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available

 ### Setup Example

+When both workers are registered, requests are automatically routed.
+
 ```python
 # Decode worker registration (in your decode worker)
 await register_llm(
@@ -92,12 +93,8 @@ await register_llm(
 )
 ```

-When both workers are registered, requests are automatically routed:
-1. **Prefill phase** → Prefill router selects best prefill worker (KV-aware)
-2. **Decode phase** → Decode router selects decode worker (uses frontend's `--router-mode`)
-
 > [!Note]
-> **WIP**: Currently, the prefill router always uses KV routing. Future updates will provide more fine-grained control over prefill routing behavior to match user-specified frontend router modes.
+> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).

 ## Overview