Unverified Commit 49087845 authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs(router): split router docs into focused pages (#8122)


Signed-off-by: default avatarPeaBrane <yanrpei@gmail.com>
parent e48de6aa
......@@ -3,7 +3,7 @@
# Standalone Router
A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/components/router/router-guide.md).
A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see [Routing Concepts](/docs/components/router/router-concepts.md).
## Overview
......@@ -29,7 +29,7 @@ python -m dynamo.router \
- `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
**Router Configuration:**
All router options use the `--router-*` prefix (e.g., `--router-block-size`, `--router-kv-overlap-score-weight`, `--router-temperature`, `--router-kv-events` / `--no-router-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, `--router-track-active-blocks` / `--no-router-track-active-blocks`, `--router-track-prefill-tokens` / `--no-router-track-prefill-tokens`). Legacy names without the prefix (e.g., `--block-size`, `--kv-events`) are still accepted but deprecated. For detailed descriptions, see the [Router Guide](/docs/components/router/router-guide.md).
All router options use the `--router-*` prefix (e.g., `--router-block-size`, `--router-kv-overlap-score-weight`, `--router-temperature`, `--router-kv-events` / `--no-router-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, `--router-track-active-blocks` / `--no-router-track-active-blocks`, `--router-track-prefill-tokens` / `--no-router-track-prefill-tokens`). Legacy names without the prefix (e.g., `--block-size`, `--kv-events`) are still accepted but deprecated. For detailed descriptions, see [Configuration and Tuning](/docs/components/router/router-configuration.md).
## Architecture
......@@ -43,7 +43,7 @@ Clients call the `generate` endpoint to stream completions, or call `best_worker
## Example: Manual Disaggregated Serving (Alternative Setup)
> [!Note]
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/components/router/router-guide.md#disaggregated-serving) for the default setup.
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See [Disaggregated Serving](/docs/components/router/router-disaggregated-serving.md) for the default setup.
>
> Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
......@@ -106,7 +106,9 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere
## See Also
- [Router Guide](/docs/components/router/router-guide.md) - Configuration and tuning for KV-aware routing
- [Router Guide](/docs/components/router/router-guide.md) - Deployment modes and quick start
- [Configuration and Tuning](/docs/components/router/router-configuration.md) - CLI flags, transport modes, and metrics
- [Disaggregated Serving](/docs/components/router/router-disaggregated-serving.md) - Prefill and decode routing setups
- [Router Design](/docs/design-docs/router-design.md) - Architecture details and event transport modes
- [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
- [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
......@@ -112,5 +112,5 @@ for chunk in response:
## See Also
- **[NVIDIA Request Extensions (nvext)](../../components/frontend/nvext.md)**: Full `nvext` field reference including agent hints
- **[Router Guide](../../components/router/router-guide.md)**: Router configuration and CLI arguments
- **[Configuration and Tuning](../../components/router/router-configuration.md)**: Router configuration and CLI arguments
- **[SGLang HiCache](../../integrations/sglang-hicache.md)**: Enabling hierarchical KV cache
......@@ -140,4 +140,4 @@ SGLang workers expose operational endpoints via Dynamo's system server:
- **[Examples](sglang-examples.md)**: All deployment patterns
- **[Disaggregation](sglang-disaggregation.md)**: P/D architecture and KV transfer
- **[Diffusion](sglang-diffusion.md)**: LLM, image, and video diffusion models
- **[Router Guide](../../components/router/router-guide.md)**: KV-aware routing configuration
- **[Configuration and Tuning](../../components/router/router-configuration.md)**: KV-aware routing configuration
......@@ -32,7 +32,7 @@ docker compose -f deploy/docker-compose.yml up -d
Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs `python3 -m dynamo.frontend <args>` to start up the ingress and `python3 -m dynamo.trtllm <args>` to start up the workers.
</Tip>
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router-guide.md).
For detailed information about KV-aware routing behavior, see [Routing Concepts](../../components/router/router-concepts.md). For deployment modes, see the [Router Guide](../../components/router/router-guide.md).
## Single Node Examples
......
......@@ -64,6 +64,5 @@ For configuration details, see the [FlexKV Integration Guide](../../integrations
## See Also
- **[KVBM Design](../../design-docs/kvbm-design.md)**: Architecture and design of Dynamo's built-in KV cache offloading
- **[KV-Aware Routing](../../components/router/router-guide.md)**: Routing requests based on KV cache state
- **[Routing Concepts](../../components/router/router-concepts.md)**: Routing requests based on KV cache state
- **[Disaggregated Serving](../../design-docs/disagg-serving.md)**: Prefill/decode separation architecture
......@@ -96,5 +96,5 @@ Dynamo supports [request migration](../../fault-tolerance/request-migration.md)
- **[Examples](vllm-examples.md)**: All deployment patterns with launch scripts
- **[vLLM README](README.md)**: Quick start and feature overview
- **[Observability](vllm-observability.md)**: Metrics and monitoring setup
- **[Router Guide](../../components/router/router-guide.md)**: KV-aware routing configuration
- **[Configuration and Tuning](../../components/router/router-configuration.md)**: KV-aware routing configuration
- **[Fault Tolerance](../../fault-tolerance/README.md)**: Request migration, cancellation, and graceful shutdown
......@@ -826,7 +826,7 @@ VllmPrefillWorker:
## Conclusion
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see the [Tuning Guidelines](../components/router/router-guide.md#tuning-guidelines).
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see [Tuning Guidelines](../components/router/router-configuration.md#tuning-guidelines).
For questions or issues, consult the [Dynamo documentation](https://github.com/ai-dynamo/dynamo) or open an issue on GitHub.
......
......@@ -179,6 +179,6 @@ All endpoint paths can be overridden via environment variables:
- [Frontend Overview](README.md) — quick start and feature matrix
- [Frontend Guide](frontend-guide.md) — KServe gRPC configuration
- [NVIDIA Request Extensions (nvext)](nvext.md) — custom request fields
- [Router Guide](../router/router-guide.md) — detailed routing configuration
- [Configuration and Tuning](../router/router-configuration.md) — detailed routing configuration
- [Metrics](../../observability/metrics.md) — available Prometheus metrics
- [Fault Tolerance](../../fault-tolerance/README.md) — request migration and rejection
......@@ -163,5 +163,5 @@ When the client requests response metadata via `extra_fields`, the response incl
| Document | Description |
|----------|-------------|
| [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration |
| [Router Guide](../router/router-guide.md) | Full router configuration and CLI arguments |
| [Configuration and Tuning](../router/router-configuration.md) | Full router configuration and CLI arguments |
| [SGLang for Agentic Workloads](../../backends/sglang/agents.md) | SGLang engine flags for priority scheduling and eviction policies |
......@@ -29,7 +29,7 @@ For Kubernetes, set `DYN_ROUTER_MODE=kv` on the Frontend service. Workers automa
You can also run the KV router as a standalone service (without the Dynamo frontend). See the [Standalone Router component](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/router/) for more details.
For all CLI arguments, environment variables, K8s deployment examples, and tuning guidelines, see the [Router Guide](router-guide.md). For A/B benchmarking, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For deployment modes and quick start steps, see the [Router Guide](router-guide.md). For CLI arguments and tuning guidelines, see [Configuration and Tuning](router-configuration.md). For A/B benchmarking, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
## Prerequisites and Limitations
......@@ -50,7 +50,11 @@ For basic model registration without KV routing, use `--router-mode round-robin`
## Next Steps
- **[Router Guide](router-guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
- **[Router Guide](router-guide.md)**: Deployment modes, quick start, and page map
- **[Routing Concepts](router-concepts.md)**: Cost model and worker-selection behavior
- **[Configuration and Tuning](router-configuration.md)**: Router flags, transport modes, and metrics
- **[Disaggregated Serving](router-disaggregated-serving.md)**: Prefill and decode routing setups
- **[Router Operations](router-operations.md)**: Replicas, persistence, and recovery
- **[Router Examples](router-examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Standalone Indexer](standalone-indexer.md)**: Run the KV indexer as a separate service for independent scaling
- **[Router Design](../../design-docs/router-design.md)**: Architecture details, algorithms, and event transport modes
......@@ -106,5 +106,6 @@ For deployments using Dynamo's KV-aware routing, the local indexer is used autom
## See Also
- **[KV Router Index Data Structures](https://github.com/ai-dynamo/dynamo/blob/main/lib/kv-router/src/indexer/README.md)**: `RadixTree`, `ConcurrentRadixTree`, and `PositionalIndexer` internals
- **[Router Guide](router-guide.md)**: Configuration, deployment, and tuning for KV-aware routing
- **[Router Guide](router-guide.md)**: Deployment modes and quick start for KV-aware routing
- **[Configuration and Tuning](router-configuration.md)**: Router flags and tuning details
- **[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Routing Concepts
subtitle: Cost model, worker selection, and routing primitives for the Dynamo router
---
This page explains how the Dynamo router evaluates workers, chooses a target, and fits into the request path. For CLI flags and tuning knobs, see [Configuration and Tuning](router-configuration.md).
## KV Cache Routing
KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.
```mermaid
graph TD
T[Tokens] --> R[KV Aware Router]
R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
style T fill:#fff3e0,stroke:#333,color:#333
style R fill:#2e8b57,stroke:#333,color:#fff
style W1 fill:#f3e5f5,stroke:#333,color:#333
style W2 fill:#c8e6c9,stroke:#333,color:#333
style W3 fill:#f3e5f5,stroke:#333,color:#333
linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
```
KV cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions.
## Cost Calculation
1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
Lower costs indicate better routing choices.
`overlap_score_weight` balances cache hit optimization against load distribution.
Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL).
## Worker Selection
The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
## Using the KV Cache Router
To enable KV cache-aware routing, start the frontend node like this:
```bash
python -m dynamo.frontend --router-mode kv
```
When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly.
To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing.
For detailed CLI arguments and advanced configuration options, see [Configuration and Tuning](router-configuration.md).
## Basic Routing
Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
First, create a client tied to a component endpoint. Here we get a client tied to the `generate` endpoint of the `VllmWorker` component.
```python
client = runtime.endpoint("dynamo.VllmWorker.generate").client()
```
You can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
- **Least-loaded routing**: Routes to the worker with fewest active connections via `--router-mode least-loaded`
- **Device-aware weighted routing**: Routes using CPU/non-CPU ratio budgeting plus least-loaded selection within the selected device group via `--router-mode device-aware-weighted`
In disaggregated prefill paths it skips bootstrap optimization and uses the synchronous prefill path, matching power-of-two routing.
KV cache routing uses direct routing with a special worker selection algorithm.
For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For custom routing logic and advanced patterns, see [Routing Patterns](router-examples.md#routing-patterns).
## Device-Aware Weighted Routing
`device-aware-weighted` is designed for heterogeneous fleets where CPU and non-CPU workers share the same endpoint. Instead of comparing raw in-flight counts, the router compares a capability-normalized load across the CPU and non-CPU groups, then selects the least-loaded worker within the winning group.
```text
normalized_load = total_inflight(group) / (instance_count(group) x throughput_weight)
```
The throughput weight is `1` for CPU workers and `DYN_ENCODER_CUDA_TO_CPU_RATIO` for non-CPU workers. This lets the router route proportionally to device capability instead of permanently starving slower devices.
When only one device class is present, the behavior degenerates to standard least-loaded routing.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Configuration and Tuning
subtitle: Router flags, event transport, load tracking, and tuning guidance
---
This page collects the main router flags for frontend-embedded and standalone deployments. For the routing cost model and worker-selection behavior, see [Routing Concepts](router-concepts.md).
## Routing Behavior
- `--router-kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.
- `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
- `--router-track-prefill-tokens`: Enables prompt-side load accounting in the worker cost model. This should stay enabled if you want queue thresholds, `active_prefill_tokens`, and AIC prefill load decay to reflect prompt work.
- `--router-prefill-load-model`: Selects the router's prompt-side load model. `none` keeps the existing static prompt load accounting. `aic` predicts one expected prefill duration per admitted request and lazily decays only the oldest active prefill request on each worker.
- `--router-queue-threshold`: Queue threshold fraction for prefill token capacity (default: 4.0). The router holds incoming requests in a priority queue while all workers exceed this fraction of `max_num_batched_tokens`, releasing them when capacity frees up. This defers dispatch rather than rejecting work, so routing decisions use the freshest load metrics at the moment a request is actually sent to a worker. It also enables priority scheduling via `priority` hints in `nvext.agent_hints`. Must be greater than 0. Set to `None` to disable queueing.
- `--router-queue-policy`: Scheduling policy for the router queue (default: `fcfs`).
`fcfs` orders by adjusted arrival time (`priority_jump - arrival_offset`) and optimizes tail TTFT.
`lcfs` orders by adjusted reverse arrival time (`priority_jump + arrival_offset`) and mainly serves controlled comparison experiments.
`wspt` orders by `(1 + priority_jump) / isl_tokens` and optimizes average TTFT.
For `--router-mode device-aware-weighted`, set `DYN_ENCODER_CUDA_TO_CPU_RATIO` to the approximate throughput ratio of one non-CPU worker relative to one CPU worker. The default is `8`.
## KV Event Transport and Persistence
- `--no-router-kv-events`: Disables KV event tracking. By default, the router uses KV events to monitor block creation and deletion from workers. When disabled, the router predicts cache state from routing decisions with TTL-based expiration and pruning.
- `--router-durable-kv-events`: **Deprecated.** Enables JetStream mode for KV event transport. The event-plane subscriber in local indexer mode is now the recommended path.
- `--router-reset-states`: Only applies in JetStream mode (`--router-durable-kv-events`). Resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting from a fresh state.
- `--router-snapshot-threshold`: Only applies in JetStream mode (`--router-durable-kv-events`). Sets the number of messages in JetStream before triggering a snapshot.
## Block Tracking
- `--no-router-track-active-blocks`: Disables tracking of active blocks used for ongoing generation or decode phases. Disable this when routing to workers that only perform prefill.
- `--router-track-output-blocks`: **Experimental.** Enables tracking of output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward the expected output sequence length (`agent_hints.osl` in `nvext`).
- `--no-router-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. This is useful in disaggregated setups where transferred blocks are not actually deduplicated on the decode side.
- `--no-router-track-prefill-tokens`: Disables prompt-side prefill token accounting in the router's active load model. Use this for decode-only routing paths where prompt processing already happened elsewhere.
- `--router-replica-sync`: Disabled by default. Enables NATS-based synchronization of local routing decisions between router replicas.
## KV Indexer / Approx KV Indexer
- `--router-ttl-secs`: Time-to-live in seconds for blocks in the router's local cache predictions. Defaults to 120.0 seconds when `--no-router-kv-events` is used.
- `--router-max-tree-size`: Maximum tree size before pruning is triggered. Defaults to 1048576 (2^20 blocks) when `--no-router-kv-events` is used.
- `--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. Defaults to 0.8 when `--no-router-kv-events` is used.
- `--router-event-threads`: Number of event processing threads for the KV indexer (default: 4). With KV events enabled, values greater than 1 use the concurrent radix tree; approximate mode always uses a single-threaded indexer.
To implement KV event publishing for custom inference engines, see [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md).
For details on per-request agent hints (`priority`, `osl`, `speculative_prefill`), see [NVIDIA Request Extensions (`nvext`)](../frontend/nvext.md#agent-hints).
## Tuning Guidelines
`--router-kv-overlap-score-weight` is the primary knob for balancing prefill efficiency against decode load. Prefill-heavy workloads benefit from a higher weight, which steers requests toward workers with better cache overlap and reduces TTFT. Decode-heavy workloads benefit from a lower weight, which distributes decode load more evenly and reduces ITL. The default of 1.0 is a reasonable starting point. This weight can also be overridden per request via `nvext.agent_hints.kv_overlap_score_weight`.
Use `--no-router-kv-events` when you are not confident that your backend engine emits KV events correctly. In this mode the router falls back to approximate routing, predicting cache state from its own routing decisions with TTL-based expiration and pruning.
Use `--no-router-assume-kv-reuse` in disaggregated setups where the decode worker does not reuse transferred KV cache blocks. Without this flag, the router undercounts decode blocks when duplicates exist, leading to inaccurate load estimates.
Use `--no-router-track-prefill-tokens` when a router is serving decode-only traffic and prompt processing has already completed elsewhere. This keeps decode routing decisions focused on decode-side load instead of briefly charging prompt tokens to the decode worker after handoff.
Use `--router-track-output-blocks` when your workload is output-heavy and you want the router to account for output-side KV cache growth in load balancing. If you also pass `nvext.agent_hints.osl` per request, the router applies fractional decay to output blocks so that requests nearing completion contribute less future load.
`--router-queue-threshold` controls when incoming requests are held in a priority queue. The router waits while all workers exceed the configured fraction of `max_num_batched_tokens`, then releases work as capacity frees up. Set it to `None` to disable queueing entirely.
Use `--router-prefill-load-model aic` when you want prompt-side load tracking to decay the oldest active prefill request using an AIC-predicted duration instead of keeping prompt load static until first token. This requires `--router-track-prefill-tokens` and the shared `--aic-*` config.
Use `--router-queue-policy wspt` when your workload has a mix of short and long requests and you want to minimize average TTFT. Use the default `fcfs` when you want to minimize tail TTFT.
## Prometheus Metrics
The router exposes Prometheus metrics on the frontend's HTTP port (default 8000) at `/metrics`:
- **Router request metrics** (`dynamo_component_router_*`): Registered via the component's metrics hierarchy and exposed on the frontend via the `drt_metrics` bridge. In KV mode they are populated per request; in non-KV modes they are registered with zero values. The standalone router also registers these metrics, available on `DYN_SYSTEM_PORT` when set.
- **Routing overhead metrics** (`dynamo_router_overhead_*`) and **per-worker gauges** (`dynamo_frontend_worker_*`): Registered on the frontend's own Prometheus registry. These are frontend-only and not available on the standalone router.
For the full list of router metrics, see the [Metrics reference](../../observability/metrics.md#router-metrics).
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Disaggregated Serving
subtitle: Prefill and decode routing with the Dynamo router
---
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill`, the frontend automatically detects them and activates an internal prefill router.
For the high-level deployment matrix, see [Router Guide](router-guide.md). For the router flags used in this setup, see [Configuration and Tuning](router-configuration.md).
## Automatic Prefill Router Activation
The prefill router is automatically created when:
1. A decode model is registered, for example via `register_model()` with `ModelType.Chat | ModelType.Completions`.
2. A prefill worker is detected with the same model name and `ModelType.Prefill`.
Key characteristics of the prefill router:
- **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers do not perform decode.
- **Seamlessly integrates** into the request pipeline between preprocessing and decode routing.
- **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available.
Key characteristics of the decode routing stage in disaggregated mode:
- **Disables overlap scoring** (`overlap_score_weight=0`) because decode routing should not chase prefix reuse.
- **Disables KV reuse assumption** (`assume_kv_reuse=false`) unless the backend can truly deduplicate transferred blocks.
- **Disables prefill-token tracking** (`track_prefill_tokens=false`) so decode-side load reflects decode work rather than already-completed prompt work.
## Setup Example
When both workers are registered, requests are automatically routed.
```python
# Decode worker registration (in your decode worker)
decode_endpoint = runtime.endpoint("dynamo.decode.generate")
await register_model(
model_input=ModelInput.Tokens,
model_type=ModelType.Chat | ModelType.Completions,
endpoint=decode_endpoint,
model_name="meta-llama/Llama-2-7b-hf",
# ... other parameters
)
await decode_endpoint.serve_endpoint(decode_handler.generate)
# Prefill worker registration (in your prefill worker)
prefill_endpoint = runtime.endpoint("dynamo.prefill.generate")
await register_model(
model_input=ModelInput.Tokens,
model_type=ModelType.Prefill,
endpoint=prefill_endpoint,
model_name="meta-llama/Llama-2-7b-hf",
# ... other parameters
)
await prefill_endpoint.serve_endpoint(prefill_handler.generate)
```
>[!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang, launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags such as `--router-block-size` and `--router-kv-events`. See the [Standalone Router README](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/router/README.md) and [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/launch/disagg_router.sh).
## Request Flow
The following diagram shows an overview of the major components in disaggregated serving:
```mermaid
graph TD
HTTP[HTTP]
ROUTER[Router]
PREFILL[Prefill Worker]
DECODE[Decode Worker]
classDef worker_style fill:#f3e5f5,stroke:#333,stroke-width:2px,color:#333;
classDef router_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff;
class PREFILL,DECODE worker_style
class ROUTER router_style
HTTP <--> |"request/response"| ROUTER
ROUTER --> |"1. send to prefill"| PREFILL
PREFILL --> |"2. return NIXL metadata"| ROUTER
ROUTER --> |"3. send with metadata"| DECODE
DECODE --> |"4. stream response"| ROUTER
PREFILL -.-> |"publish kv events"| ROUTER
linkStyle 0,1,2,3,4 stroke:#8b4513,stroke-width:2px
linkStyle 5 stroke:#2196f3,stroke-width:2px
```
......@@ -293,5 +293,5 @@ For deployments with multiple worker pools, the **Global Router** enables hierar
## See Also
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Guide](router-guide.md)**: Configuration, tuning, and production setup
- **[Configuration and Tuning](router-configuration.md)**: Router flags and production setup
- **[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
This diff is collapsed.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Router Operations
subtitle: Replica topology, remote indexers, state management, and recovery
---
This page covers day-2 operational topics for router deployments. For flags and tuning guidance, see [Configuration and Tuning](router-configuration.md).
## Serving Multiple Router Replicas
For improved fault tolerance, you can launch multiple frontend-plus-router replicas. If multiple `dynamo.frontend` processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone `python -m dynamo.router` service.
## Dynamo-Native Remote Indexer
For Dynamo-native deployments, the remote indexer is served by `dynamo.frontend` or `dynamo.router`, not by `dynamo.indexer`.
- Use `--serve-indexer` on router or frontend replicas that should expose `kv_indexer_query` from the worker component.
- Use `--use-remote-indexer` on consumer routers or frontends that should query that served endpoint instead of maintaining a local overlap indexer.
- `dynamo.indexer` remains the standalone HTTP plus ZMQ microservice for non-Dynamo or direct-ZMQ deployments.
Frontend example:
```bash
# Serving anchors
python -m dynamo.frontend --router-mode kv --serve-indexer
# Consumer frontend
python -m dynamo.frontend --router-mode kv --use-remote-indexer
```
The served service is request-plane only. Each serving router or frontend keeps its normal local KV event ingestion, gap detection, and worker-query recovery path; remote consumers only issue hash-based overlap queries.
Approximate mode (`--no-router-kv-events`) is singleton-only for remote serving: only one `--serve-indexer` replica may exist for a given worker component. Event-driven mode allows multiple serving replicas behind the same worker component.
```mermaid
graph TD
subgraph "Workers"
W1["Worker 1"]
W2["Worker 2"]
end
subgraph "Event Plane"
EP["KV Events"]
end
subgraph "Serving Routers / Frontends"
S1["Router / Frontend A<br/>--serve-indexer"]
S2["Router / Frontend B<br/>--serve-indexer"]
I1["Local Indexer"]
I2["Local Indexer"]
end
subgraph "Request Plane"
RP["backend.kv_indexer_query"]
end
C["Consumer Router / Frontend<br/>--use-remote-indexer"]
W1 --> EP
W2 --> EP
EP --> S1
EP --> S2
S1 --> I1
S2 --> I2
C --> RP
RP --> S1
RP --> S2
```
## Router State Management
The KV router tracks two types of state:
1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is persistent. In local indexer mode, state is rebuilt from workers on startup. In JetStream mode (`--router-durable-kv-events`) it is backed by JetStream events and object store snapshots.
2. **Active blocks (decoding blocks)**: Tracks blocks currently being used for active generation requests. This state is ephemeral. When a new router replica starts, it begins with zero active block knowledge but becomes eventually consistent as it handles requests.
For the architecture behind these states, see [Router Design](../../design-docs/router-design.md).
## Enabling Router Replica Synchronization
```bash
# Router replica 1
python -m dynamo.frontend --router-mode kv --http-port 8000 --router-replica-sync
# Router replica 2
python -m dynamo.frontend --router-mode kv --http-port 8001 --router-replica-sync
```
The `--router-replica-sync` flag enables active block synchronization between replicas:
- Active blocks are shared via NATS core messaging.
- Replicas exchange routing decisions to maintain consistent load estimates.
- A new replica starts with zero active blocks but quickly converges through request handling and active syncing with other replicas.
Without this flag, each replica maintains its own isolated view of active blocks, which can lead to suboptimal routing.
## Persistence and Recovery
Persistence behavior depends on the event transport mode.
### NATS Core / Event Plane with Local Indexer Mode
- State persists on workers. Events are fire-and-forget, but workers retain their local indexer state.
- On startup, the router queries each worker's local indexer to rebuild state.
- Recovery depends on workers being available. If a worker is down, its blocks cannot be recovered.
- This mode keeps the infrastructure simpler because JetStream is not required.
For more on gap detection and replay, see [KV Event Replay — Dynamo vs vLLM](kv-event-replay-comparison.md).
### JetStream Mode
JetStream mode requires `--router-durable-kv-events` on both frontend and workers.
- Prefix blocks are stored in NATS JetStream with 1-hour retention.
- Snapshots are saved to NATS object store at configurable thresholds.
- New replicas automatically restore this state on startup.
- You can launch a third router replica even if the first two are down, and it will recover the full prefix state.
```bash
python -m dynamo.frontend --router-mode kv --http-port 8002 --router-replica-sync
```
>[!Note]
> If you need to start with a fresh state in JetStream mode, you have two options:
> 1. Use a different namespace or component, which creates a new stream and NATS object store path.
> 2. Launch a router with `--router-reset-states`, which purges the entire stream and radix snapshot. Only do this when launching the first router replica in a component, because it can bring existing replicas into an inconsistent state.
## Additional Notes
State persistence depends on the event transport mode:
- **NATS Core / event plane mode**: State persists on workers, and the router rebuilds state by querying workers on startup.
- **JetStream mode**: State persists across router restarts via JetStream and NATS object store snapshots.
- **No KV events** (`--no-router-kv-events`): State persistence is not supported.
Request-plane transport is independent of KV event transport. The request plane (`DYN_REQUEST_PLANE` or `--request-plane`) controls how requests reach workers. KV events use NATS in JetStream or NATS Core modes, or ZMQ when `--event-plane zmq` is set. With `--event-plane zmq` and `--discovery-backend file` or `mem`, the router can run without etcd or NATS. When using a NATS-based event plane, NATS is initialized automatically; set `NATS_SERVER=nats://...` to override the default `localhost:4222`.
When `--router-kv-overlap-score-weight` is set to 0, no KV indexer is created and prefix matching is disabled. When `--no-router-kv-events` is set, a KV indexer is still created but no event subscriber is launched; the router predicts cache state from its own routing decisions with TTL-based expiration and pruning.
Backend KV event publishing is independent of the frontend's `--no-router-kv-events` flag. The frontend flag controls whether the router consumes events; backend flags control whether workers publish them. If the router is not consuming events, workers that still publish will waste resources but cause no harm.
- **vLLM**: Pass `--kv-events-config '{"enable_kv_cache_events": false}'` to disable, or `'{"enable_kv_cache_events": true, "publisher": "zmq", "endpoint": "tcp://*:5557"}'` to enable.
- **SGLang**: Pass `--kv-events-config` with a JSON config to enable, or omit it to keep publishing disabled.
- **TRT-LLM**: Pass `--publish-events-and-metrics` to enable, or omit it to keep publishing disabled.
The CLI args `--router-ttl-secs`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When workers are configured to publish KV events, the router relies on worker-side eviction events and these parameters are ignored.
`--router-queue-threshold` and the busy thresholds (`--active-decode-blocks-threshold`, `--active-prefill-tokens-threshold`, `--active-prefill-tokens-threshold-frac`) serve different purposes. Busy thresholds reject a worker entirely from the candidate set when it exceeds a utilization limit. In contrast, `--router-queue-threshold` defers the entire routing decision until at least one worker has capacity, so the request is routed with the freshest load metrics. The busy thresholds can be updated at runtime without restarting the frontend via the `/busy_threshold` HTTP endpoint. For details, see [Request Rejection](../../fault-tolerance/request-rejection.md).
......@@ -441,6 +441,6 @@ sequenceDiagram
## See Also
- **[Mooncake KV Indexer RFC](https://github.com/kvcache-ai/Mooncake/issues/1403)**: Community API standardization for KV cache indexers
- **[Router Guide](router-guide.md)**: Full KV router configuration and tuning
- **[Configuration and Tuning](router-configuration.md)**: Full KV router configuration and tuning
- **[Router Design](../../design-docs/router-design.md)**: Architecture and event transport modes
- **[Standalone Router](../../../components/src/dynamo/router/README.md)**: Full routing service (routes requests to workers)
......@@ -314,6 +314,6 @@ This dual-layer approach—persistent global KV cache state via JetStream and ep
## See Also
- **[Router README](../components/router/README.md)**: Quick start guide for the KV Router
- **[Router Guide](../components/router/router-guide.md)**: Configuration, tuning, and production setup
- **[Configuration and Tuning](../components/router/router-configuration.md)**: Router flags, tuning, and production setup
- **[Router Examples](../components/router/router-examples.md)**: Python API usage and custom routing patterns
- **[KV Event Publishing for Custom Engines](../integrations/kv-events-custom-engines.md)**: Integrate custom inference engines with KV-aware routing
......@@ -308,5 +308,5 @@ This works end-to-end across the publisher pipeline, the KV consolidator (for de
- [Feature Matrix](../../reference/feature-matrix.md) - Backend compatibility overview
- [vLLM Backend](../../backends/vllm/README.md) - vLLM-specific configuration
- [Dynamo Operator](../../kubernetes/dynamo-operator.md) - Kubernetes operator overview
- [KV-Aware Routing](../../components/router/router-guide.md) - LoRA-aware request routing
- [Routing Concepts](../../components/router/router-concepts.md) - LoRA-aware request routing
- [KV Events for Custom Engines](../../integrations/kv-events-custom-engines.md) - Publishing LoRA-aware KV events
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment