A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/components/router/router-guide.md).
A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see [Routing Concepts](/docs/components/router/router-concepts.md).
## Overview
...
...
@@ -29,7 +29,7 @@ python -m dynamo.router \
-`--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
**Router Configuration:**
All router options use the `--router-*` prefix (e.g., `--router-block-size`, `--router-kv-overlap-score-weight`, `--router-temperature`, `--router-kv-events` / `--no-router-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, `--router-track-active-blocks` / `--no-router-track-active-blocks`, `--router-track-prefill-tokens` / `--no-router-track-prefill-tokens`). Legacy names without the prefix (e.g., `--block-size`, `--kv-events`) are still accepted but deprecated. For detailed descriptions, see the [Router Guide](/docs/components/router/router-guide.md).
All router options use the `--router-*` prefix (e.g., `--router-block-size`, `--router-kv-overlap-score-weight`, `--router-temperature`, `--router-kv-events` / `--no-router-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, `--router-track-active-blocks` / `--no-router-track-active-blocks`, `--router-track-prefill-tokens` / `--no-router-track-prefill-tokens`). Legacy names without the prefix (e.g., `--block-size`, `--kv-events`) are still accepted but deprecated. For detailed descriptions, see [Configuration and Tuning](/docs/components/router/router-configuration.md).
## Architecture
...
...
@@ -43,7 +43,7 @@ Clients call the `generate` endpoint to stream completions, or call `best_worker
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/components/router/router-guide.md#disaggregated-serving) for the default setup.
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See [Disaggregated Serving](/docs/components/router/router-disaggregated-serving.md) for the default setup.
>
> Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
...
...
@@ -106,7 +106,9 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere
## See Also
-[Router Guide](/docs/components/router/router-guide.md) - Configuration and tuning for KV-aware routing
-[Router Guide](/docs/components/router/router-guide.md) - Deployment modes and quick start
-[Configuration and Tuning](/docs/components/router/router-configuration.md) - CLI flags, transport modes, and metrics
-[Disaggregated Serving](/docs/components/router/router-disaggregated-serving.md) - Prefill and decode routing setups
-[Router Design](/docs/design-docs/router-design.md) - Architecture details and event transport modes
-[Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
-[Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
@@ -32,7 +32,7 @@ docker compose -f deploy/docker-compose.yml up -d
Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs `python3 -m dynamo.frontend <args>` to start up the ingress and `python3 -m dynamo.trtllm <args>` to start up the workers.
</Tip>
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router-guide.md).
For detailed information about KV-aware routing behavior, see [Routing Concepts](../../components/router/router-concepts.md). For deployment modes, see the [Router Guide](../../components/router/router-guide.md).
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see the [Tuning Guidelines](../components/router/router-guide.md#tuning-guidelines).
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see [Tuning Guidelines](../components/router/router-configuration.md#tuning-guidelines).
For questions or issues, consult the [Dynamo documentation](https://github.com/ai-dynamo/dynamo) or open an issue on GitHub.
@@ -29,7 +29,7 @@ For Kubernetes, set `DYN_ROUTER_MODE=kv` on the Frontend service. Workers automa
You can also run the KV router as a standalone service (without the Dynamo frontend). See the [Standalone Router component](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/router/) for more details.
For all CLI arguments, environment variables, K8s deployment examples, and tuning guidelines, see the [Router Guide](router-guide.md). For A/B benchmarking, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For deployment modes and quick start steps, see the [Router Guide](router-guide.md). For CLI arguments and tuning guidelines, see [Configuration and Tuning](router-configuration.md). For A/B benchmarking, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
## Prerequisites and Limitations
...
...
@@ -50,7 +50,11 @@ For basic model registration without KV routing, use `--router-mode round-robin`
## Next Steps
-**[Router Guide](router-guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
-**[Router Guide](router-guide.md)**: Deployment modes, quick start, and page map
-**[Routing Concepts](router-concepts.md)**: Cost model and worker-selection behavior
-**[Configuration and Tuning](router-configuration.md)**: Router flags, transport modes, and metrics
-**[Disaggregated Serving](router-disaggregated-serving.md)**: Prefill and decode routing setups
-**[Router Operations](router-operations.md)**: Replicas, persistence, and recovery
-**[Router Examples](router-examples.md)**: Python API usage, K8s examples, and custom routing patterns
-**[Standalone Indexer](standalone-indexer.md)**: Run the KV indexer as a separate service for independent scaling
-**[Router Design](../../design-docs/router-design.md)**: Architecture details, algorithms, and event transport modes
@@ -106,5 +106,6 @@ For deployments using Dynamo's KV-aware routing, the local indexer is used autom
## See Also
-**[KV Router Index Data Structures](https://github.com/ai-dynamo/dynamo/blob/main/lib/kv-router/src/indexer/README.md)**: `RadixTree`, `ConcurrentRadixTree`, and `PositionalIndexer` internals
-**[Router Guide](router-guide.md)**: Configuration, deployment, and tuning for KV-aware routing
-**[Router Guide](router-guide.md)**: Deployment modes and quick start for KV-aware routing
-**[Configuration and Tuning](router-configuration.md)**: Router flags and tuning details
-**[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:Routing Concepts
subtitle:Cost model, worker selection, and routing primitives for the Dynamo router
---
This page explains how the Dynamo router evaluates workers, chooses a target, and fits into the request path. For CLI flags and tuning knobs, see [Configuration and Tuning](router-configuration.md).
## KV Cache Routing
KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.
```mermaid
graph TD
T[Tokens] --> R[KV Aware Router]
R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
style T fill:#fff3e0,stroke:#333,color:#333
style R fill:#2e8b57,stroke:#333,color:#fff
style W1 fill:#f3e5f5,stroke:#333,color:#333
style W2 fill:#c8e6c9,stroke:#333,color:#333
style W3 fill:#f3e5f5,stroke:#333,color:#333
linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
```
KV cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions.
## Cost Calculation
1.**Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2.**Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
`overlap_score_weight` balances cache hit optimization against load distribution.
Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL).
## Worker Selection
The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
Example calculation with `overlap_score_weight = 1.0`:
To enable KV cache-aware routing, start the frontend node like this:
```bash
python -m dynamo.frontend --router-mode kv
```
When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly.
To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing.
For detailed CLI arguments and advanced configuration options, see [Configuration and Tuning](router-configuration.md).
## Basic Routing
Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
First, create a client tied to a component endpoint. Here we get a client tied to the `generate` endpoint of the `VllmWorker` component.
You can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
-**Random routing**: Default strategy, available via `client.generate()` or `client.random()`
-**Round-robin routing**: Cycles through available workers via `client.round_robin()`
-**Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
-**Least-loaded routing**: Routes to the worker with fewest active connections via `--router-mode least-loaded`
-**Device-aware weighted routing**: Routes using CPU/non-CPU ratio budgeting plus least-loaded selection within the selected device group via `--router-mode device-aware-weighted`
In disaggregated prefill paths it skips bootstrap optimization and uses the synchronous prefill path, matching power-of-two routing.
KV cache routing uses direct routing with a special worker selection algorithm.
For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For custom routing logic and advanced patterns, see [Routing Patterns](router-examples.md#routing-patterns).
## Device-Aware Weighted Routing
`device-aware-weighted` is designed for heterogeneous fleets where CPU and non-CPU workers share the same endpoint. Instead of comparing raw in-flight counts, the router compares a capability-normalized load across the CPU and non-CPU groups, then selects the least-loaded worker within the winning group.
```text
normalized_load = total_inflight(group) / (instance_count(group) x throughput_weight)
```
The throughput weight is `1` for CPU workers and `DYN_ENCODER_CUDA_TO_CPU_RATIO` for non-CPU workers. This lets the router route proportionally to device capability instead of permanently starving slower devices.
When only one device class is present, the behavior degenerates to standard least-loaded routing.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:Configuration and Tuning
subtitle:Router flags, event transport, load tracking, and tuning guidance
---
This page collects the main router flags for frontend-embedded and standalone deployments. For the routing cost model and worker-selection behavior, see [Routing Concepts](router-concepts.md).
## Routing Behavior
-`--router-kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.
-`--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
-`--router-track-prefill-tokens`: Enables prompt-side load accounting in the worker cost model. This should stay enabled if you want queue thresholds, `active_prefill_tokens`, and AIC prefill load decay to reflect prompt work.
-`--router-prefill-load-model`: Selects the router's prompt-side load model. `none` keeps the existing static prompt load accounting. `aic` predicts one expected prefill duration per admitted request and lazily decays only the oldest active prefill request on each worker.
-`--router-queue-threshold`: Queue threshold fraction for prefill token capacity (default: 4.0). The router holds incoming requests in a priority queue while all workers exceed this fraction of `max_num_batched_tokens`, releasing them when capacity frees up. This defers dispatch rather than rejecting work, so routing decisions use the freshest load metrics at the moment a request is actually sent to a worker. It also enables priority scheduling via `priority` hints in `nvext.agent_hints`. Must be greater than 0. Set to `None` to disable queueing.
-`--router-queue-policy`: Scheduling policy for the router queue (default: `fcfs`).
`fcfs` orders by adjusted arrival time (`priority_jump - arrival_offset`) and optimizes tail TTFT.
`lcfs` orders by adjusted reverse arrival time (`priority_jump + arrival_offset`) and mainly serves controlled comparison experiments.
`wspt` orders by `(1 + priority_jump) / isl_tokens` and optimizes average TTFT.
For `--router-mode device-aware-weighted`, set `DYN_ENCODER_CUDA_TO_CPU_RATIO` to the approximate throughput ratio of one non-CPU worker relative to one CPU worker. The default is `8`.
## KV Event Transport and Persistence
-`--no-router-kv-events`: Disables KV event tracking. By default, the router uses KV events to monitor block creation and deletion from workers. When disabled, the router predicts cache state from routing decisions with TTL-based expiration and pruning.
-`--router-durable-kv-events`: **Deprecated.** Enables JetStream mode for KV event transport. The event-plane subscriber in local indexer mode is now the recommended path.
-`--router-reset-states`: Only applies in JetStream mode (`--router-durable-kv-events`). Resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting from a fresh state.
-`--router-snapshot-threshold`: Only applies in JetStream mode (`--router-durable-kv-events`). Sets the number of messages in JetStream before triggering a snapshot.
## Block Tracking
-`--no-router-track-active-blocks`: Disables tracking of active blocks used for ongoing generation or decode phases. Disable this when routing to workers that only perform prefill.
-`--router-track-output-blocks`: **Experimental.** Enables tracking of output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward the expected output sequence length (`agent_hints.osl` in `nvext`).
-`--no-router-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. This is useful in disaggregated setups where transferred blocks are not actually deduplicated on the decode side.
-`--no-router-track-prefill-tokens`: Disables prompt-side prefill token accounting in the router's active load model. Use this for decode-only routing paths where prompt processing already happened elsewhere.
-`--router-replica-sync`: Disabled by default. Enables NATS-based synchronization of local routing decisions between router replicas.
## KV Indexer / Approx KV Indexer
-`--router-ttl-secs`: Time-to-live in seconds for blocks in the router's local cache predictions. Defaults to 120.0 seconds when `--no-router-kv-events` is used.
-`--router-max-tree-size`: Maximum tree size before pruning is triggered. Defaults to 1048576 (2^20 blocks) when `--no-router-kv-events` is used.
-`--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. Defaults to 0.8 when `--no-router-kv-events` is used.
-`--router-event-threads`: Number of event processing threads for the KV indexer (default: 4). With KV events enabled, values greater than 1 use the concurrent radix tree; approximate mode always uses a single-threaded indexer.
To implement KV event publishing for custom inference engines, see [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md).
For details on per-request agent hints (`priority`, `osl`, `speculative_prefill`), see [NVIDIA Request Extensions (`nvext`)](../frontend/nvext.md#agent-hints).
## Tuning Guidelines
`--router-kv-overlap-score-weight` is the primary knob for balancing prefill efficiency against decode load. Prefill-heavy workloads benefit from a higher weight, which steers requests toward workers with better cache overlap and reduces TTFT. Decode-heavy workloads benefit from a lower weight, which distributes decode load more evenly and reduces ITL. The default of 1.0 is a reasonable starting point. This weight can also be overridden per request via `nvext.agent_hints.kv_overlap_score_weight`.
Use `--no-router-kv-events` when you are not confident that your backend engine emits KV events correctly. In this mode the router falls back to approximate routing, predicting cache state from its own routing decisions with TTL-based expiration and pruning.
Use `--no-router-assume-kv-reuse` in disaggregated setups where the decode worker does not reuse transferred KV cache blocks. Without this flag, the router undercounts decode blocks when duplicates exist, leading to inaccurate load estimates.
Use `--no-router-track-prefill-tokens` when a router is serving decode-only traffic and prompt processing has already completed elsewhere. This keeps decode routing decisions focused on decode-side load instead of briefly charging prompt tokens to the decode worker after handoff.
Use `--router-track-output-blocks` when your workload is output-heavy and you want the router to account for output-side KV cache growth in load balancing. If you also pass `nvext.agent_hints.osl` per request, the router applies fractional decay to output blocks so that requests nearing completion contribute less future load.
`--router-queue-threshold` controls when incoming requests are held in a priority queue. The router waits while all workers exceed the configured fraction of `max_num_batched_tokens`, then releases work as capacity frees up. Set it to `None` to disable queueing entirely.
Use `--router-prefill-load-model aic` when you want prompt-side load tracking to decay the oldest active prefill request using an AIC-predicted duration instead of keeping prompt load static until first token. This requires `--router-track-prefill-tokens` and the shared `--aic-*` config.
Use `--router-queue-policy wspt` when your workload has a mix of short and long requests and you want to minimize average TTFT. Use the default `fcfs` when you want to minimize tail TTFT.
## Prometheus Metrics
The router exposes Prometheus metrics on the frontend's HTTP port (default 8000) at `/metrics`:
-**Router request metrics** (`dynamo_component_router_*`): Registered via the component's metrics hierarchy and exposed on the frontend via the `drt_metrics` bridge. In KV mode they are populated per request; in non-KV modes they are registered with zero values. The standalone router also registers these metrics, available on `DYN_SYSTEM_PORT` when set.
-**Routing overhead metrics** (`dynamo_router_overhead_*`) and **per-worker gauges** (`dynamo_frontend_worker_*`): Registered on the frontend's own Prometheus registry. These are frontend-only and not available on the standalone router.
For the full list of router metrics, see the [Metrics reference](../../observability/metrics.md#router-metrics).
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:Disaggregated Serving
subtitle:Prefill and decode routing with the Dynamo router
---
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill`, the frontend automatically detects them and activates an internal prefill router.
For the high-level deployment matrix, see [Router Guide](router-guide.md). For the router flags used in this setup, see [Configuration and Tuning](router-configuration.md).
## Automatic Prefill Router Activation
The prefill router is automatically created when:
1. A decode model is registered, for example via `register_model()` with `ModelType.Chat | ModelType.Completions`.
2. A prefill worker is detected with the same model name and `ModelType.Prefill`.
Key characteristics of the prefill router:
-**Always disables active block tracking** (`track_active_blocks=false`) since prefill workers do not perform decode.
-**Seamlessly integrates** into the request pipeline between preprocessing and decode routing.
-**Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available.
Key characteristics of the decode routing stage in disaggregated mode:
-**Disables overlap scoring** (`overlap_score_weight=0`) because decode routing should not chase prefix reuse.
-**Disables KV reuse assumption** (`assume_kv_reuse=false`) unless the backend can truly deduplicate transferred blocks.
-**Disables prefill-token tracking** (`track_prefill_tokens=false`) so decode-side load reflects decode work rather than already-completed prompt work.
## Setup Example
When both workers are registered, requests are automatically routed.
```python
# Decode worker registration (in your decode worker)
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang, launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags such as `--router-block-size` and `--router-kv-events`. See the [Standalone Router README](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/router/README.md) and [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/launch/disagg_router.sh).
## Request Flow
The following diagram shows an overview of the major components in disaggregated serving:
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title:Router Operations
subtitle:Replica topology, remote indexers, state management, and recovery
---
This page covers day-2 operational topics for router deployments. For flags and tuning guidance, see [Configuration and Tuning](router-configuration.md).
## Serving Multiple Router Replicas
For improved fault tolerance, you can launch multiple frontend-plus-router replicas. If multiple `dynamo.frontend` processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone `python -m dynamo.router` service.
## Dynamo-Native Remote Indexer
For Dynamo-native deployments, the remote indexer is served by `dynamo.frontend` or `dynamo.router`, not by `dynamo.indexer`.
- Use `--serve-indexer` on router or frontend replicas that should expose `kv_indexer_query` from the worker component.
- Use `--use-remote-indexer` on consumer routers or frontends that should query that served endpoint instead of maintaining a local overlap indexer.
-`dynamo.indexer` remains the standalone HTTP plus ZMQ microservice for non-Dynamo or direct-ZMQ deployments.
The served service is request-plane only. Each serving router or frontend keeps its normal local KV event ingestion, gap detection, and worker-query recovery path; remote consumers only issue hash-based overlap queries.
Approximate mode (`--no-router-kv-events`) is singleton-only for remote serving: only one `--serve-indexer` replica may exist for a given worker component. Event-driven mode allows multiple serving replicas behind the same worker component.
1.**Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is persistent. In local indexer mode, state is rebuilt from workers on startup. In JetStream mode (`--router-durable-kv-events`) it is backed by JetStream events and object store snapshots.
2.**Active blocks (decoding blocks)**: Tracks blocks currently being used for active generation requests. This state is ephemeral. When a new router replica starts, it begins with zero active block knowledge but becomes eventually consistent as it handles requests.
For the architecture behind these states, see [Router Design](../../design-docs/router-design.md).
> If you need to start with a fresh state in JetStream mode, you have two options:
> 1. Use a different namespace or component, which creates a new stream and NATS object store path.
> 2. Launch a router with `--router-reset-states`, which purges the entire stream and radix snapshot. Only do this when launching the first router replica in a component, because it can bring existing replicas into an inconsistent state.
## Additional Notes
State persistence depends on the event transport mode:
-**NATS Core / event plane mode**: State persists on workers, and the router rebuilds state by querying workers on startup.
-**JetStream mode**: State persists across router restarts via JetStream and NATS object store snapshots.
-**No KV events** (`--no-router-kv-events`): State persistence is not supported.
Request-plane transport is independent of KV event transport. The request plane (`DYN_REQUEST_PLANE` or `--request-plane`) controls how requests reach workers. KV events use NATS in JetStream or NATS Core modes, or ZMQ when `--event-plane zmq` is set. With `--event-plane zmq` and `--discovery-backend file` or `mem`, the router can run without etcd or NATS. When using a NATS-based event plane, NATS is initialized automatically; set `NATS_SERVER=nats://...` to override the default `localhost:4222`.
When `--router-kv-overlap-score-weight` is set to 0, no KV indexer is created and prefix matching is disabled. When `--no-router-kv-events` is set, a KV indexer is still created but no event subscriber is launched; the router predicts cache state from its own routing decisions with TTL-based expiration and pruning.
Backend KV event publishing is independent of the frontend's `--no-router-kv-events` flag. The frontend flag controls whether the router consumes events; backend flags control whether workers publish them. If the router is not consuming events, workers that still publish will waste resources but cause no harm.
-**vLLM**: Pass `--kv-events-config '{"enable_kv_cache_events": false}'` to disable, or `'{"enable_kv_cache_events": true, "publisher": "zmq", "endpoint": "tcp://*:5557"}'` to enable.
-**SGLang**: Pass `--kv-events-config` with a JSON config to enable, or omit it to keep publishing disabled.
-**TRT-LLM**: Pass `--publish-events-and-metrics` to enable, or omit it to keep publishing disabled.
The CLI args `--router-ttl-secs`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When workers are configured to publish KV events, the router relies on worker-side eviction events and these parameters are ignored.
`--router-queue-threshold` and the busy thresholds (`--active-decode-blocks-threshold`, `--active-prefill-tokens-threshold`, `--active-prefill-tokens-threshold-frac`) serve different purposes. Busy thresholds reject a worker entirely from the candidate set when it exceeds a utilization limit. In contrast, `--router-queue-threshold` defers the entire routing decision until at least one worker has capacity, so the request is routed with the freshest load metrics. The busy thresholds can be updated at runtime without restarting the frontend via the `/busy_threshold` HTTP endpoint. For details, see [Request Rejection](../../fault-tolerance/request-rejection.md).