docs: Migrate router documentation to three-tier structure (#5979)

Signed-off-by: akshatha-k <akshutk@gmail.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: Migrate router documentation to three-tier structure (#5979)
Signed-off-by: akshatha-k <akshutk@gmail.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
80e7bafd · akshatha-k · GitHub · b5c0db63 · 80e7bafd · 80e7bafd
Unverified Commit 80e7bafd authored Feb 05, 2026 by akshatha-k Committed by GitHub Feb 06, 2026
20 changed files
--- a/README.md
+++ b/README.md
@@ -52,7 +52,7 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
 |---|:----:|:----------:|:--:|
 | **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
 | [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
-| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
+| [**KV-Aware Routing**](docs/router/README.md) | ✅ | ✅ | ✅ |
 | [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
 | [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ |
 | [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
@@ -388,7 +388,7 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL
 <!-- Reference links for Feature Compatibility Matrix -->
 [disagg]: docs/design_docs/disagg_serving.md
-[kv-routing]: docs/router/kv_cache_routing.md
+[kv-routing]: docs/router/README.md
 [planner]: docs/planner/sla_planner.md
 [kvbm]: docs/kvbm/README.md
 [mm]: examples/multimodal/

--- a/benchmarks/router/README.md
+++ b/benchmarks/router/README.md
@@ -127,7 +127,7 @@ To see all available router arguments, run:
 python -m dynamo.frontend --help
 ```
-For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md).
+For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/router/router_guide.md).
 > [!Note]
 > If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
@@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a
 - Uses the same routing mode as the frontend's `--router-mode` setting
 - Seamlessly integrates with your decode workers for token generation
-No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details.
+No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/router/router_guide.md#disaggregated-serving) for more details.
 > [!Note]
 > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)

--- a/components/src/dynamo/router/README.md
+++ b/components/src/dynamo/router/README.md
@@ -3,7 +3,7 @@
 # Standalone Router
-A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [KV Cache Routing documentation](/docs/router/kv_cache_routing.md).
+A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/router/router_guide.md).
 ## Overview
@@ -29,7 +29,7 @@ python -m dynamo.router \
 - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
 **Router Configuration:**
-For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [KV Cache Routing documentation](/docs/router/kv_cache_routing.md).
+For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/router/router_guide.md).
 ## Architecture
@@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p
 ## Example: Manual Disaggregated Serving (Alternative Setup)
 > [!Note]
-> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [KV Cache Routing documentation](../../../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for the default setup.
+> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/router/router_guide.md#disaggregated-serving) for the default setup.
 >
 > Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.
@@ -103,6 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere
 ## See Also
- [KV Cache Routing Architecture](/docs/router/kv_cache_routing.md) - Detailed explanation of KV-aware routing
+- [Router Guide](/docs/router/router_guide.md) - Configuration and tuning for KV-aware routing
+- [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes
 - [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
 - [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
--- a/deploy/inference-gateway/README.md
+++ b/deploy/inference-gateway/README.md
@@ -216,11 +216,11 @@ Common Vars for Routing Configuration:
 - Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
 - By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion.
 - If using kv-routing:
-  - Overwrite the `DYN_KV_BLOCK_SIZE` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) to match your model's block size. The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
+  - Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
-  - Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
+  - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
-  - Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
+  - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
-  - Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
+  - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
-  - See the [KV cache routing design](../../docs/router/kv_cache_routing.md) for details.
+  - See the [Router Guide](../../docs/router/router_guide.md) for details.
 Stand-Alone installation only:

--- a/docs/backends/sglang/README.md
+++ b/docs/backends/sglang/README.md
@@ -36,7 +36,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|--------|-------|
 | [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
-| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
 | [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ |  |
 | [**Multimodal Support**](../../multimodal/sglang.md) | ✅ |  |
 | [**KVBM**](../../kvbm/README.md) | ❌ | Planned |

--- a/docs/backends/trtllm/README.md
+++ b/docs/backends/trtllm/README.md
@@ -55,7 +55,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|--------------|-------|
 | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
 | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
 | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
 | [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
@@ -114,7 +114,7 @@ apt-get update && apt-get -y install git git-lfs
 > [!IMPORTANT]
 > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
-For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv_cache_routing.md).
+For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../router/router_guide.md).
 ### Aggregated
 ```bash

--- a/docs/backends/vllm/README.md
+++ b/docs/backends/vllm/README.md
@@ -37,7 +37,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 |---------|------|-------|
 | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
-| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../router/README.md) | ✅ |  |
 | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ |  |
 | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
 | [**KVBM**](../../../docs/kvbm/README.md) | ✅ |  |
@@ -179,7 +179,7 @@ When using KV-aware routing, ensure deterministic hashing across processes to av
 ```bash
 vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
 ```
-See the high-level notes in [KV Cache Routing](../../../docs/router/kv_cache_routing.md) on deterministic event IDs.
+See the high-level notes in [Router Design](../../design_docs/router_design.md#deterministic-event-ids) on deterministic event IDs.
 ## Request Migration

--- a/docs/design_docs/architecture.md
+++ b/docs/design_docs/architecture.md
@@ -53,7 +53,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
 The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
 - [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](../router/kv_cache_routing.md)
+- [Dynamo Smart Router](../router/README.md)
 - [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
 - [Planner](../planner/planner_intro.rst)
 - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)

--- a/docs/design_docs/router_design.md
+++ b/docs/design_docs/router_design.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Router Design
+This document describes the internal architecture of the Dynamo KV Router, including block tracking mechanisms, the KV cache optimization system, event handling, and transport modes.
+## KV Router Architecture
+The KV Router tracks two key metrics for each worker:
+1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
+2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
+   - New prefill tokens = Total input tokens - (Overlap blocks × Block size)
+   - Potential prefill blocks = New prefill tokens / Block size
+### Block Tracking Mechanisms
+The router maintains block information through two complementary systems:
+- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
+  - Incremented when adding a new request
+  - Updated during token generation
+  - Decremented upon request completion
+- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
+## KV Cache Router
+The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching).
+### KV Cache Routing and Load Balancing
+```mermaid
+graph TD
+    T[Tokens] --> R[KV Aware Router]
+    R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
+    R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
+    R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
+    style T fill:#fff3e0,stroke:#333,color:#333
+    style R fill:#2e8b57,stroke:#333,color:#fff
+    style W1 fill:#f3e5f5,stroke:#333,color:#333
+    style W2 fill:#c8e6c9,stroke:#333,color:#333
+    style W3 fill:#f3e5f5,stroke:#333,color:#333
+    linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
+```
+The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions.
+#### Cost Calculation
+1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
+2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
+3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
+   - Lower costs indicate better routing choices
+   - `overlap_score_weight` balances cache hit optimization against load distribution
+   - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
+#### Worker Selection
+The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
+Example calculation with `overlap_score_weight = 1.0`:
+- Worker 1: cost = 1.0 * 8 + 10 = 18
+- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
+- Worker 3: cost = 1.0 * 2 + 9 = 11
+### KV Cache Optimizations
+Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks.
+Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse.
+In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally.
+### KV Block Management Flow
+To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow:
+1. **Request tokenization**: The incoming prompt is converted into tokens
+2. **Block partitioning**: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block)
+3. **Block hashing**: Each block of tokens is hashed to create a unique identifier
+4. **Cache lookup**:
+    - For each block, the system checks if a matching block already exists in the KV cache
+    - If a match is found, the existing KV cache block is reused
+    - If no match is found, the system proceeds to the next step
+5. **Resource allocation**:
+    - For blocks without matches, the system attempts to allocate new memory space
+    - If sufficient memory is available, allocate memory space and proceed to step 7
+    - If memory is constrained, proceed to step 6
+6. **Cache eviction** (when necessary):
+    - The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal
+    - Selected blocks are evicted from the cache
+    - **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.**
+    - Alternatively, some systems may offload less-frequently used blocks to CPU memory.
+7. **KV computation**:
+    - For new blocks, the model computes key and value tensors
+    - These tensors are stored in the newly allocated cache blocks
+    - **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**.
+Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/).
+## Events
+### KVPublisher
+The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed.
+The two types of events are:
+- KV stored event
+- KV removed event
+The publisher can be initialized and used through C bindings or Python bindings.
+### Deterministic Event IDs
+Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's built-in `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect.
+### KVIndexer
+The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.
+The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
+### Inter-Router Communication
+In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
+1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
+2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
+3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
+Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
+## Event Transport Modes
+The router supports two event transport modes for KV cache state synchronization:
+- **JetStream (default)**: Persistent event stream with durable consumers. State persists across router restarts via snapshots in NATS object store. Best for production with multi-replica consistency.
+- **NATS Core with Local Indexer** (`--enable-local-indexer` on workers): Fire-and-forget pub/sub where workers maintain local radix trees. Router rebuilds state by querying workers on startup. Lower latency, simpler setup.
+### JetStream Mode
+KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts.
+- **Best for**: Production deployments requiring durability and multi-replica router consistency
+- **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees
+```mermaid
+graph TD
+    subgraph Engines
+        E1[Engine 1<br/>KVPublisher]
+        E2[Engine 2<br/>KVPublisher]
+        E3[Engine 3<br/>KVPublisher]
+    end
+    subgraph "NATS JetStream"
+        JS[(Persistent KV Events Stream<br/>- Block created<br/>- Block removed)]
+    end
+    subgraph "NATS Object Store"
+        OS[(Radix Tree<br/>State Snapshot)]
+    end
+    subgraph "Router Replicas"
+        R1[Router 1<br/>KVIndexer]
+        R2[Router 2<br/>KVIndexer]
+    end
+    E1 -->|Publish Events| JS
+    E2 -->|Publish Events| JS
+    E3 -->|Publish Events| JS
+    JS -->|Consume as Durable Consumer| R1
+    JS -->|Consume as Durable Consumer| R2
+    JS -->|Periodic Snapshot| OS
+    style JS fill:#e1f5fe,stroke:#333,color:#333
+    style OS fill:#e1f5fe,stroke:#333,color:#333
+    style E1 fill:#f3e5f5,stroke:#333,color:#333
+    style E2 fill:#f3e5f5,stroke:#333,color:#333
+    style E3 fill:#f3e5f5,stroke:#333,color:#333
+    style R1 fill:#2e8b57,stroke:#333,color:#fff
+    style R2 fill:#2e8b57,stroke:#333,color:#fff
+    linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px
+```
+### NATS Core with Local Indexer
+When workers are started with `--enable-local-indexer`, each worker maintains its own local radix tree (local indexer) and publishes events over NATS Core (fire-and-forget pub/sub) instead of JetStream. Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly.
+- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios
+- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available
+- **Enable with**: `--enable-local-indexer` flag on workers (vLLM, mocker)
+```mermaid
+graph TD
+    subgraph Engines
+        E1[Engine 1<br/>LocalKvIndexer]
+        E2[Engine 2<br/>LocalKvIndexer]
+        E3[Engine 3<br/>LocalKvIndexer]
+    end
+    subgraph "NATS Core"
+        NC[KV Events Pub/Sub<br/>- Block created<br/>- Block removed]
+    end
+    subgraph "Router Replicas"
+        R1[Router 1<br/>KVIndexer]
+        R2[Router 2<br/>KVIndexer]
+    end
+    E1 -->|Publish Events| NC
+    E2 -->|Publish Events| NC
+    E3 -->|Publish Events| NC
+    NC -->|Subscribe| R1
+    NC -->|Subscribe| R2
+    style NC fill:#e1f5fe,stroke:#333,color:#333
+    style E1 fill:#f3e5f5,stroke:#333,color:#333
+    style E2 fill:#f3e5f5,stroke:#333,color:#333
+    style E3 fill:#f3e5f5,stroke:#333,color:#333
+    style R1 fill:#2e8b57,stroke:#333,color:#fff
+    style R2 fill:#2e8b57,stroke:#333,color:#fff
+    linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px
+```
+**How gap detection works:**
+1. Each worker assigns monotonically increasing event IDs starting from 0
+2. The router tracks the last received event ID per worker
+3. If an event arrives with `event_id > last_id + 1`, the router detects a gap
+4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]`
+5. On worker discovery (Added event), the router dumps the worker's entire local indexer state
+**Startup behavior:**
+- When a worker is discovered, the router queries and ingests its full local indexer state
+- When a worker is removed, the router removes all its blocks from the global radix tree
+>[!Note]
+> The router automatically selects the transport mode based on worker configuration. If all connected workers have `enable_local_indexer=true`, the router uses NATS Core mode. Otherwise, it uses JetStream mode.
+### Local Active Block Management with Replica Sync
+In addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when:
+- The router receives and routes a request
+- The first token is generated (prefill complete)
+- The response ends (request freed)
+This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging.
+```mermaid
+sequenceDiagram
+    participant C1 as Client 1
+    participant R1 as Router 1<br/>(Slot Manager)
+    participant R2 as Router 2<br/>(Slot Manager)
+    participant C2 as Client 2
+    Note over R1,R2: Router Replica Sync Enabled
+    C1->>R1: Request A
+    activate R1
+    R1->>R1: Predict blocks & route to worker
+    R1-->>R2: Sync: AddRequest(A)
+    C2->>R2: Request B
+    activate R2
+    R2->>R2: Predict blocks & route to worker
+    R2-->>R1: Sync: AddRequest(B)
+    R1->>R1: First token received<br/>(prefill complete)
+    R1-->>R2: Sync: MarkPrefillCompleted(A)
+    R1->>C1: Stream response
+    R2->>R2: First token received<br/>(prefill complete)
+    R2-->>R1: Sync: MarkPrefillCompleted(B)
+    R2->>C2: Stream response
+    R1->>R1: Response complete<br/>(free blocks)
+    R1-->>R2: Sync: Free(A)
+    deactivate R1
+    R2->>R2: Response complete<br/>(free blocks)
+    R2-->>R1: Sync: Free(B)
+    deactivate R2
+    Note over R1,R2: Both routers have consistent<br/>view of active blocks
+```
+This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution.
+## See Also
+- **[Router README](../router/README.md)**: Quick start guide for the KV Router
+- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup
+- **[Router Examples](../router/router_examples.md)**: Python API usage and custom routing patterns
+- **[KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing
--- a/docs/features/lora/README.md
+++ b/docs/features/lora/README.md
@@ -311,4 +311,4 @@ kubectl logs deployment/my-worker | grep -i lora
 - [Feature Matrix](../../reference/feature-matrix.md) - Backend compatibility overview
 - [vLLM Backend](../../backends/vllm/README.md) - vLLM-specific configuration
 - [Dynamo Operator](../../kubernetes/dynamo_operator.md) - Kubernetes operator overview
- [KV-Aware Routing](../../router/kv_cache_routing.md) - LoRA-aware request routing
+- [KV-Aware Routing](../../router/router_guide.md) - LoRA-aware request routing
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -37,11 +37,11 @@
   kubernetes/README.md
   reference/cli.md
   observability/metrics.md
+   integrations/kv_events_custom_engines.md
   agents/tool-calling.md
   development/jail_stream.md
-   router/kv_cache_routing.md
+   router/router_examples.md
-   router/kv_events.md
   planner/load_planner.md
   fault_tolerance/README.md
   fault_tolerance/request_migration.md
@@ -75,6 +75,7 @@
   backends/vllm/deepseek-r1.md
   backends/vllm/gpt-oss.md
+   integrations/lmcache_integration.md
   backends/vllm/multi-node.md
   backends/vllm/prometheus.md
   backends/vllm/prompt-embeddings.md

--- a/docs/index.rst
+++ b/docs/index.rst
@@ -59,6 +59,7 @@ Quickstart
   :caption: User Guides
   KV Cache Offloading <kvbm/kvbm_guide.md>
+   KV Aware Routing <router/router_guide.md>
   Tool Calling <agents/tool-calling.md>
   Multimodality Support <features/multimodal/README.md>
   LoRA Adapters <features/lora/README.md>
@@ -89,6 +90,7 @@ Quickstart
   Architecture Flow <design_docs/dynamo_flow.md>
   Disaggregated Serving <design_docs/disagg_serving.md>
   Distributed Runtime <design_docs/distributed_runtime.md>
+   Router Design <design_docs/router_design.md>
   Request Plane <design_docs/request_plane.md>
   Event Plane <design_docs/event_plane.md>
   Planner Design <design_docs/planner_design.md>
--- a/docs/router/kv_events.md
+++ b/docs/router/kv_events.md
@@ -282,3 +282,9 @@ Each event in the payload is a dictionary with `type` field (`BlockStored`, `Blo
 2. **Block size must match** your engine's actual `kv_block_size`
 3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching
+## See Also
+- **[Router README](../router/README.md)**: Quick start guide for the KV Router
+- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup
+- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes
--- a/docs/reference/feature-matrix.md
+++ b/docs/reference/feature-matrix.md
@@ -119,7 +119,7 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full
 <!-- Design Docs -->
 [disagg]: docs/design_docs/disagg_serving.md
-[kv-routing]: docs/router/kv_cache_routing.md
+[kv-routing]: docs/router/README.md
 [planner]: docs/planner/planner_intro.rst
 [kvbm]: docs/kvbm/kvbm_intro.rst
 [migration]: docs/fault_tolerance/request_migration.md

--- a/docs/router/README.md
+++ b/docs/router/README.md
@@ -3,11 +3,9 @@ SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 SPDX-License-Identifier: Apache-2.0
 -->
-# KV Router
+# Router
-## Overview
+The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
-The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
 ## Quick Start
@@ -24,14 +22,23 @@ This command:
 - Exposes the service on port 8000 (configurable)
 - Automatically handles all backend workers registered to the Dynamo endpoint
-Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
+Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
- Tracks the state of all registered workers
- Makes routing decisions based on KV cache overlap
+#### CLI Arguments
- Balances load across available workers
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
+| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
+| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
+| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
+| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
+For all available options: `python -m dynamo.frontend --help`
 ### Kubernetes Deployment
-To enable the KV Router in a Kubernetes deployment, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
+To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
 ```yaml
 apiVersion: nvidia.com/v1alpha1
@@ -47,11 +54,6 @@ spec:
      envs:
        - name: DYN_ROUTER_MODE
          value: kv  # Enable KV Smart Router
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
-    Worker:
-      # ... worker configuration ...
 ```
 **Key Points:**
@@ -59,258 +61,43 @@ spec:
 - Workers automatically report KV cache events to the router
 - No worker-side configuration changes needed
-**Complete K8s Examples:**
+#### Environment Variables
- [TRT-LLM aggregated router example](../../examples/backends/trtllm/deploy/agg_router.yaml)
- [vLLM aggregated router example](../../examples/backends/vllm/deploy/agg_router.yaml)
- [SGLang aggregated router example](../../examples/backends/sglang/deploy/agg_router.yaml)
- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
-**For A/B Testing and Advanced K8s Setup:**
-See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
-## Configuration Options
-### CLI Arguments (Python Deployment)
-The KV Router supports several key configuration options:
- **`--router-mode kv`**: Enable KV cache-aware routing (required)
- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
-  - `0.0`: Deterministic selection of the best worker
-  - `> 0.0`: Probabilistic selection using softmax sampling
-  - Higher values increase randomness, helping prevent worker saturation
- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
-  - `--kv-events`: Uses real-time events from workers for accurate cache tracking
-  - `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)
- **`--kv-overlap-score-weight <float>`**: Balance between prefill and decode optimization (default: 1.0)
-  - Higher values (> 1.0): Prioritize reducing prefill cost (better TTFT)
-  - Lower values (< 1.0): Prioritize decode performance (better ITL)
-For a complete list of available options:
-```bash
-python -m dynamo.frontend --help
-```
-### Kubernetes Environment Variables
-All CLI arguments can be configured via environment variables in Kubernetes deployments. Use the `DYN_` prefix with uppercase parameter names:
-| CLI Argument | K8s Environment Variable | Default | Description |
-|--------------|-------------------------|---------|-------------|
-| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` | Enable KV router |
-| `--router-temperature <float>` | `DYN_ROUTER_TEMPERATURE=<float>` | `0.0` | Routing randomness |
-| `--kv-cache-block-size <size>` | `DYN_KV_CACHE_BLOCK_SIZE=<size>` | Backend-specific | KV cache block size |
-| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` | Disable KV event tracking |
-| `--kv-overlap-score-weight <float>` | `DYN_KV_OVERLAP_SCORE_WEIGHT=<float>` | `1.0` | Prefill vs decode weight |
-| `--http-port <port>` | `DYN_HTTP_PORT=<port>` | `8000` | HTTP server port |
-### Example with Advanced Configuration
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-spec:
-  services:
-    Frontend:
-      dynamoNamespace: my-namespace
-      componentType: frontend
-      replicas: 1
-      envs:
-        - name: DYN_ROUTER_MODE
-          value: kv
-        - name: DYN_ROUTER_TEMPERATURE
-          value: "0.5"  # Add some randomness to prevent worker saturation
-        - name: DYN_KV_OVERLAP_SCORE_WEIGHT
-          value: "1.5"  # Prioritize TTFT over ITL
-        - name: DYN_KV_CACHE_BLOCK_SIZE
-          value: "16"
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
-```
-### Alternative: Using Command Args in K8s
-You can also pass CLI arguments directly in the container command:
-```yaml
-extraPodSpec:
-  mainContainer:
-    image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
-    command:
-      - /bin/sh
-      - -c
-    args:
-      - "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
-```
-**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
-## KV Router Architecture
-The KV Router tracks two key metrics for each worker:
-1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
-2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
-   - New prefill tokens = Total input tokens - (Overlap blocks × Block size)
-   - Potential prefill blocks = New prefill tokens / Block size
-### Block Tracking Mechanisms
+All CLI arguments can be configured via environment variables using the `DYN_` prefix:
-The router maintains block information through two complementary systems:
+| CLI Argument | Environment Variable | Default |
+|--------------|---------------------|---------|
+| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
+| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
+| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
+| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
+| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
+For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples).
-  - Incremented when adding a new request
-  - Updated during token generation
-  - Decremented upon request completion
- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
+For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md).
-## Cost Function
+For more configuration options and tuning guidelines, see the [Router Guide](router_guide.md).
-The KV Router's routing decision is based on a simple cost function:
-```
-logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
-```
-Where:
- Lower logit values are better (less computational cost)
- The router uses softmax sampling with optional temperature to select workers
-### Key Parameter: kv-overlap-score-weight
-The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:
- **Higher values (> 1.0)**: Emphasize reducing prefill cost
-  - Prioritizes routing to workers with better cache hits
-  - Optimizes for Time To First Token (TTFT)
-  - Best for workloads where initial response latency is critical
- **Lower values (< 1.0)**: Emphasize decode performance
-  - Distributes active decoding blocks more evenly
-  - Optimizes for Inter-Token Latency (ITL)
-  - Best for workloads with long generation sequences
-## KV Events vs. Approximation Mode
-The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
- **With KV Events (default)**:
-  - Calculates overlap accurately using actual cached blocks
-  - Provides higher accuracy with event processing overhead
-  - Recommended for production deployments
- **Without KV Events (--no-kv-events)**:
-  - Router predicts cache state based on routing decisions with TTL-based expiration and pruning
-  - Tracks blocks from recent requests with configurable time-to-live
-  - Reduces overhead at the cost of routing accuracy
-  - **NATS is not needed** - suitable for simpler deployments without NATS infrastructure
-  - Suitable for testing or when event processing becomes a bottleneck
-## Event Transport Modes
-The router supports two event transport modes for KV cache state synchronization:
- **JetStream (default)**: Persistent event stream with durable consumers. State persists across router restarts via snapshots in NATS object store. Best for production with multi-replica consistency.
- **NATS Core with Local Indexer** (`--enable-local-indexer` on workers): Fire-and-forget pub/sub where workers maintain local radix trees. Router rebuilds state by querying workers on startup. Lower latency, simpler setup.
-See [KV Cache Routing](kv_cache_routing.md#global-kv-cache-state-synchronization) for architecture diagrams and details.
-## Disaggregated Serving
-Dynamo supports disaggregated serving where prefill and decode are handled by separate worker pools. Register prefill workers with `ModelType.Prefill` and the frontend automatically activates an internal prefill router.
-Key points:
- Prefill router auto-activates when both prefill and decode workers register with the same model name
- Supports vLLM and TensorRT-LLM backends (SGLang requires separate router setup)
- Use `--no-track-active-blocks` for prefill-only workers
-See [KV Cache Routing - Disaggregated Serving](kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for setup examples.
-## Router Replicas and State Persistence
-For high availability, run multiple router replicas with `--router-replica-sync` to synchronize active block tracking via NATS.
-State persistence options:
- **JetStream mode**: Automatic persistence via event stream and object store snapshots
- **Local Indexer mode**: State rebuilds from workers on startup
- **Reset state**: Use `--router-reset-states` to start fresh (use with caution)
-See [KV Cache Routing - Serving Multiple Router Replicas](kv_cache_routing.md#serving-multiple-router-replicas) for details.
-## Busy Thresholds
-Control worker saturation with busy thresholds:
- `--active-decode-blocks-threshold <0.0-1.0>`: Mark workers busy when KV cache utilization exceeds threshold
- `--active-prefill-tokens-threshold <count>`: Mark workers busy when active prefill tokens exceed threshold
-Thresholds can be updated at runtime via the `/busy_threshold` HTTP endpoint. See [Dynamic Threshold Configuration](kv_cache_routing.md#dynamic-threshold-configuration).
-## Python API
-For programmatic routing control, use the `KvPushRouter` class directly:
-```python
-from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig
-router = KvPushRouter(endpoint=endpoint, block_size=16, kv_router_config=KvRouterConfig())
-stream = await router.generate(token_ids=tokens, model="model-name")
-```
-Key methods: `generate()`, `best_worker()`, `get_potential_loads()`, `mark_prefill_complete()`, `free()`.
-See [KV Cache Routing - Python API](kv_cache_routing.md#using-kvpushrouter-python-api) for complete examples.
 ## Prerequisites and Limitations
- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`
+**Requirements:**
- **No multimodal support**: Currently tracks token-based blocks only
+- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
- **No static endpoints**: Use `--router-mode round-robin` for static endpoint deployments
+- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md))
+- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
-See [KV Cache Routing - Prerequisites](kv_cache_routing.md#prerequisites-and-limitations) for details.
-## Tuning Guidelines
-### 1. Understand Your Workload Characteristics
- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
-### 2. Monitor Key Metrics
-The router logs the cost calculation for each worker:
-```
-Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
-```
-This shows:
+**Multimodal Support:**
- Total cost (125.3)
+- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes
- Overlap weight × prefill blocks (1.0 × 100.5)
+- **SGLang**: Image routing not yet supported
- Active blocks (25.0)
+- **Other modalities** (audio, video, etc.): Not yet supported
- Cached blocks that contribute to overlap (15)
-### 3. Temperature-Based Routing
+**Limitations:**
+- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states
-The `router_temperature` parameter controls routing randomness:
+For basic model registration without KV routing, use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution
-### 4. Iterative Optimization
+## Next Steps
-1. Begin with default settings
+- **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
-2. Monitor TTFT and ITL metrics
+- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
-3. Adjust `kv-overlap-score-weight` to meet your performance goals:
+- **[Router Design](../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes
-   - To reduce TTFT: Increase the weight
-   - To reduce ITL: Decrease the weight
-4. If you observe severe load imbalance, increase the temperature setting
--- a/docs/router/router_examples.md
+++ b/docs/router/router_examples.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Router Examples
+For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns.
+## Table of Contents
+- [Using KvPushRouter Python API](#using-kvpushrouter-python-api)
+- [K8s Examples](#k8s-examples)
+- [Routing Patterns](#routing-patterns)
+- [Custom Routing Example: Minimizing TTFT](#custom-routing-example-minimizing-ttft)
+- [KV Event Publishing for Custom Engines](#kv-event-publishing-for-custom-engines)
+## Using KvPushRouter Python API
+Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
+>[!Warning]
+> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
+### Methods
+The `KvPushRouter` provides the following methods:
+- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
+- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
+  - Without `request_id`: Query-only, doesn't update router state
+  - With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
+- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
+- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
+- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
+- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
+### Setup
+First, launch your backend engines:
+```bash
+python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
+```
+### Example Script
+```python
+import asyncio
+from dynamollm import DistributedRuntime, KvPushRouter, KvRouterConfig
+async def main():
+    # Get runtime and create endpoint
+    runtime = DistributedRuntime.detached()
+    namespace = runtime.namespace("dynamo")
+    component = namespace.component("backend")
+    endpoint = component.endpoint("generate")
+    # Create KV router
+    kv_router_config = KvRouterConfig()
+    router = KvPushRouter(
+        endpoint=endpoint,
+        block_size=16,
+        kv_router_config=kv_router_config
+    )
+    # Your input tokens
+    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+    # Generate with per-request routing override
+    stream = await router.generate(
+        token_ids=token_ids,
+        model="meta-llama/Llama-2-7b-hf",
+        stop_conditions={
+            "max_tokens": 20,        # Generate exactly 20 tokens
+            "ignore_eos": True,      # Don't stop at EOS token
+        },
+        sampling_options={
+            "temperature": 0.7,
+            "top_p": 0.9,
+        },
+        router_config_override={
+            "overlap_score_weight": 2.0,    # Prioritize cache hits for this request
+            "router_temperature": 0.5,       # Add routing randomness
+        }
+    )
+    # Collect generated tokens
+    generated_tokens = []
+    async for response in stream:
+        if isinstance(response, dict) and "token_ids" in response:
+            generated_tokens.extend(response["token_ids"])
+    print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+## K8s Examples
+For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployment section](README.md#kubernetes-deployment) in the Quick Start guide.
+### Complete K8s Examples
+- [TRT-LLM aggregated router example](../../examples/backends/trtllm/deploy/agg_router.yaml)
+- [vLLM aggregated router example](../../examples/backends/vllm/deploy/agg_router.yaml)
+- [SGLang aggregated router example](../../examples/backends/sglang/deploy/agg_router.yaml)
+- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
+**For A/B Testing and Advanced K8s Setup:**
+See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
+### Example with Advanced Configuration
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: my-namespace
+      componentType: frontend
+      replicas: 1
+      envs:
+        - name: DYN_ROUTER_MODE
+          value: kv
+        - name: DYN_ROUTER_TEMPERATURE
+          value: "0.5"  # Add some randomness to prevent worker saturation
+        - name: DYN_KV_OVERLAP_SCORE_WEIGHT
+          value: "1.5"  # Prioritize TTFT over ITL
+        - name: DYN_KV_CACHE_BLOCK_SIZE
+          value: "16"
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
+```
+### Alternative: Using Command Args in K8s
+You can also pass CLI arguments directly in the container command:
+```yaml
+extraPodSpec:
+  mainContainer:
+    image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
+    command:
+      - /bin/sh
+      - -c
+    args:
+      - "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
+```
+**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
+## Routing Patterns
+The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
+### 1. Automatic Routing (Recommended)
+Call `generate()` directly and let the router handle everything:
+```python
+stream = await router.generate(token_ids=tokens, model="model-name")
+```
+- **Best for**: Most use cases
+- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
+### 2. Manual State Management (Advanced)
+Use `best_worker(request_id=...)` to select and track, then manage the request yourself:
+```python
+worker_id, _dp_rank, overlap = await router.best_worker(tokens, request_id="req-123")
+response = await client.generate(tokens, request_id="req-123")
+# await anext(response)  # Get first token
+await router.mark_prefill_complete("req-123")  # After first token
+# async for _ in response:  # Continue generating
+#     ...
+await router.free("req-123")  # After completion
+```
+- **Best for**: Custom request handling with router state tracking
+- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
+- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
+### 3. Hierarchical Router Probing
+Query without state updates, then route through a chosen router:
+```python
+# Probe multiple routers without updating state
+worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens)  # No request_id
+worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens)
+# Pick the best router based on results
+chosen_router = router_1 if overlap_1 > overlap_2 else router_2
+stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
+```
+- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
+- **Advantage**: Query multiple routers before committing to one
+### 4. Custom Load-Based Routing
+Use `get_potential_loads()` to implement custom routing logic:
+```python
+loads = await router.get_potential_loads(tokens)
+# Apply custom logic (e.g., weighted scoring, constraints)
+best_worker = min(loads, key=lambda x: custom_cost_fn(x))
+stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
+```
+- **Best for**: Custom optimization strategies beyond the built-in cost function
+- **Advantage**: Full control over worker selection logic
+- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
+All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
+## Custom Routing Example: Minimizing TTFT
+Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
+```python
+import asyncio
+from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig
+async def minimize_ttft_routing():
+    # Setup router
+    runtime = DistributedRuntime.detached()
+    namespace = runtime.namespace("dynamo")
+    component = namespace.component("backend")
+    endpoint = component.endpoint("generate")
+    router = KvPushRouter(
+        endpoint=endpoint,
+        block_size=16,
+        kv_router_config=KvRouterConfig()
+    )
+    # Your input tokens
+    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+    # Get potential loads for all workers
+    potential_loads = await router.get_potential_loads(token_ids)
+    # Find worker with minimum prefill tokens (best for TTFT)
+    best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
+    print(f"Worker loads: {potential_loads}")
+    print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
+    # Route directly to the selected worker
+    stream = await router.generate(
+        token_ids=token_ids,
+        model="meta-llama/Llama-2-7b-hf",
+        worker_id=best_worker['worker_id'],  # Force routing to optimal worker
+        stop_conditions={"max_tokens": 20}
+    )
+    # Process response
+    async for response in stream:
+        if isinstance(response, dict) and "token_ids" in response:
+            print(f"Generated tokens: {response['token_ids']}")
+if __name__ == "__main__":
+    asyncio.run(minimize_ttft_routing())
+```
+This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
+- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
+- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads
+- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
+See [Router Design](../design_docs/router_design.md) for architecture details and the cost function algorithm.
+## KV Event Publishing for Custom Engines
+The KV Router relies on real-time events from backend workers to track which KV cache blocks are stored on each worker. When your custom engine allocates or evicts KV cache blocks, it should publish these events so the router can make optimal routing decisions. There are two main publishing pathways: direct NATS publishing (`KvEventPublisher`) which publishes events directly to NATS and is the simplest approach for custom engines, and ZMQ-based publishing for engines with ZMQ event output (like vLLM) which uses a ZMQ publisher in the engine and `ZmqKvEventPublisher` to forward events to NATS.
+### Event Types
+The KV cache supports three event types:
+| Event Type | Description | When to Publish |
+|------------|-------------|-----------------|
+| `BlockStored` | New blocks added to cache | After KV cache allocation succeeds |
+| `BlockRemoved` | Blocks evicted from cache | When blocks are evicted or freed |
+| `AllBlocksCleared` | All blocks removed | On cache reset or worker restart |
+### Event Structure
+Each event contains:
+- **`event_id`**: Monotonically increasing identifier per worker
+- **`dp_rank`**: Data parallel rank (0 if DP not enabled)
+- **`data`**: One of `Stored`, `Removed`, or `Cleared`
+For `BlockStored` events:
+- **`token_ids`**: List of token IDs for the stored blocks
+- **`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests.
+- **`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`)
+- **`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent).
+- **`lora_id`**: LoRA adapter ID (0 if not using LoRA)
+For `BlockRemoved` events:
+- **`block_hashes`**: List of sequence block hashes being evicted
+### Option 1: Direct NATS Publishing (Recommended)
+The `KvEventPublisher` class publishes events directly to NATS. This is the simplest approach for custom engines.
+```mermaid
+flowchart LR
+    subgraph Engine["Custom Engine"]
+        cache["KV Cache Manager"]
+    end
+    subgraph Worker["Dynamo Worker Process"]
+        pub["KvEventPublisher"]
+    end
+    subgraph NATS["NATS"]
+        subject["kv-events subject"]
+    end
+    subgraph Router["KV Router"]
+        indexer["KvIndexer"]
+    end
+    cache -->|"on_blocks_stored()<br/>on_blocks_removed()"| pub
+    pub -->|"publish to NATS"| subject
+    subject --> indexer
+```
+**When to use:**
+- Building a custom inference engine from scratch
+- Your engine doesn't have a ZMQ-based event system
+- You want the simplest integration path
+#### Basic Setup
+```python
+from dynamo.llm import KvEventPublisher
+class CustomEnginePublisher:
+    def __init__(self, component, worker_id: int, block_size: int, dp_rank: int = 0):
+        self.block_size = block_size
+        self.event_id = 0
+        self.kv_publisher = KvEventPublisher(
+            component=component,
+            worker_id=worker_id,
+            kv_block_size=block_size,
+            dp_rank=dp_rank,
+            enable_local_indexer=False,
+        )
+    def on_blocks_stored(self, token_ids: list[int], block_hashes: list[int],
+                         lora_id: int = 0, parent_hash: int | None = None):
+        """Call after KV cache blocks are allocated."""
+        self.event_id += 1
+        num_block_tokens = [self.block_size] * len(block_hashes)
+        self.kv_publisher.publish_stored(
+            event_id=self.event_id,
+            token_ids=token_ids,
+            num_block_tokens=num_block_tokens,
+            block_hashes=block_hashes,
+            lora_id=lora_id,
+            parent_hash=parent_hash,
+        )
+    def on_blocks_removed(self, block_hashes: list[int]):
+        """Call when KV cache blocks are evicted."""
+        self.event_id += 1
+        self.kv_publisher.publish_removed(event_id=self.event_id, block_hashes=block_hashes)
+```
+#### Integration with Your Engine
+```python
+from dynamo.llm import register_llm
+async def main():
+    # Register your engine with Dynamo
+    component, endpoint = await register_llm(
+        model="my-model",
+        generator=my_generate_fn,
+    )
+    # Initialize publisher
+    publisher = CustomEnginePublisher(
+        component=component,
+        worker_id=endpoint.connection_id(),
+        block_size=16,  # Match your engine's block size
+    )
+    # Hook into your engine's cache events
+    def on_prefill_complete(request_id, token_ids, blocks):
+        block_hashes = [block.hash for block in blocks]
+        publisher.on_blocks_stored(token_ids=token_ids, block_hashes=block_hashes)
+    def on_cache_eviction(evicted_blocks):
+        block_hashes = [block.hash for block in evicted_blocks]
+        publisher.on_blocks_removed(block_hashes=block_hashes)
+```
+### Option 2: ZMQ-based Publishing
+For engines that publish events via ZMQ (like vLLM), this option uses two components that work together:
+1. **ZMQ Publisher** (in your engine) - Publishes events to a ZMQ socket
+2. **ZmqKvEventPublisher** (Dynamo binding) - Subscribes to ZMQ and forwards to NATS
+```mermaid
+flowchart LR
+    subgraph Engine["Custom Engine / vLLM"]
+        cache["KV Cache Manager"]
+        zmq_pub["ZMQ Publisher<br/>(Pure Python)"]
+    end
+    subgraph ZMQ["ZMQ Socket"]
+        socket["tcp://127.0.0.1:5557"]
+    end
+    subgraph Worker["Dynamo Worker Process"]
+        zmq_sub["ZmqKvEventPublisher<br/>(Rust bindings)"]
+    end
+    subgraph NATS["NATS"]
+        subject["kv-events subject"]
+    end
+    subgraph Router["KV Router"]
+        indexer["KvIndexer"]
+    end
+    cache --> zmq_pub
+    zmq_pub -->|"PUB"| socket
+    socket -->|"SUB"| zmq_sub
+    zmq_sub --> subject
+    subject --> indexer
+```
+**When to use:**
+- Your engine already has a ZMQ-based event system (like vLLM)
+- You're integrating with a consolidator (like KVBM)
+- You want to decouple event publishing from your engine's main loop
+#### Part 1: ZMQ Subscriber (Dynamo Bindings)
+If your engine already publishes to ZMQ, use `ZmqKvEventPublisher` to subscribe and forward to NATS:
+```python
+from dynamo.llm import ZmqKvEventPublisher, ZmqKvEventPublisherConfig
+# Configure the ZMQ subscriber
+config = ZmqKvEventPublisherConfig(
+    worker_id=endpoint.connection_id(),
+    kv_block_size=block_size,
+    zmq_endpoint="tcp://127.0.0.1:5557",  # Where your engine publishes
+    zmq_topic="",                          # Subscribe to all topics
+    enable_local_indexer=False,
+)
+# Create publisher - it automatically subscribes to ZMQ and forwards to NATS
+kv_publisher = ZmqKvEventPublisher(
+    component=component,
+    config=config,
+)
+```
+#### Part 2: ZMQ Publisher (Pure Python)
+If your engine needs to publish to ZMQ (e.g., for consolidator integration), implement the ZMQ protocol:
+```python
+import zmq
+import msgpack
+import time
+class ZmqKvEventPublisher:
+    """Pure Python ZMQ publisher for KV events (vLLM-compatible format)."""
+    def __init__(self, zmq_endpoint: str, kv_block_size: int, topic: str = ""):
+        self.kv_block_size = kv_block_size
+        self.topic = topic
+        self.ctx = zmq.Context()
+        self.socket = self.ctx.socket(zmq.PUB)
+        self.socket.bind(zmq_endpoint)
+        self.sequence = 0
+        self.data_parallel_rank = 0
+    def _to_signed_i64(self, value: int | None) -> int | None:
+        if value is None:
+            return None
+        return value - 0x10000000000000000 if value > 0x7FFFFFFFFFFFFFFF else value
+    def publish_stored(self, event_id: int, token_ids: list[int], num_block_tokens: list[int],
+                       block_hashes: list[int], lora_id: int = 0, parent_hash: int | None = None):
+        event = {
+            "type": "BlockStored",
+            "block_hashes": [self._to_signed_i64(h) for h in block_hashes],
+            "parent_block_hash": self._to_signed_i64(parent_hash),
+            "token_ids": token_ids,
+            "block_size": self.kv_block_size,
+            "lora_id": lora_id if lora_id != 0 else None,
+        }
+        self._publish_event(event)
+    def publish_removed(self, event_id: int, block_hashes: list[int]):
+        event = {"type": "BlockRemoved", "block_hashes": [self._to_signed_i64(h) for h in block_hashes]}
+        self._publish_event(event)
+    def publish_all_cleared(self):
+        self._publish_event({"type": "AllBlocksCleared"})
+    def _publish_event(self, event: dict):
+        batch = [time.time(), [event], self.data_parallel_rank]
+        payload = msgpack.packb(batch, use_bin_type=True)
+        sequence_bytes = self.sequence.to_bytes(8, byteorder="big")
+        self.sequence += 1
+        self.socket.send_multipart([self.topic.encode(), sequence_bytes, payload])
+    def shutdown(self):
+        self.socket.close()
+        self.ctx.term()
+```
+### ZMQ Wire Format
+The ZMQ message format (compatible with vLLM):
+| Frame | Description |
+|-------|-------------|
+| 1 | Topic (empty string for all topics) |
+| 2 | Sequence number (8 bytes, big-endian) |
+| 3 | Msgpack payload: `[timestamp, [events], dp_rank]` |
+Each event in the payload is a dictionary with `type` field (`BlockStored`, `BlockRemoved`, or `AllBlocksCleared`).
+### Best Practices
+1. **Event IDs must be monotonically increasing** per worker (use a thread-safe counter)
+2. **Block size must match** your engine's actual `kv_block_size`
+3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching
+## See Also
+- **[Router README](README.md)**: Quick start guide for the KV Router
+- **[Router Guide](router_guide.md)**: Configuration, tuning, and production setup
+- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes
--- a/docs/router/kv_cache_routing.md
+++ b/docs/router/kv_cache_routing.md
@@ -3,11 +3,63 @@ SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
 SPDX-License-Identifier: Apache-2.0
 -->
-# KV Cache Routing
+# Router Guide
-This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics.
-To enable KV cache aware routing start the frontend node like this:
+## Overview
+For quick start instructions, start with the [Router README](README.md). This guide covers details into further configuration, disaggregated serving setup, and parameter tuning.
+## KV Cache Routing
+KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.
+```mermaid
+graph TD
+    T[Tokens] --> R[KV Aware Router]
+    R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
+    R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
+    R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
+    style T fill:#fff3e0,stroke:#333,color:#333
+    style R fill:#2e8b57,stroke:#333,color:#fff
+    style W1 fill:#f3e5f5,stroke:#333,color:#333
+    style W2 fill:#c8e6c9,stroke:#333,color:#333
+    style W3 fill:#f3e5f5,stroke:#333,color:#333
+    linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
 ```
+KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
+- Missed cache reuse opportunities due to suboptimal worker selection
+- System throughput degradation from uneven request distribution across workers
+The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
+### Cost Calculation
+1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
+2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
+3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
+   - Lower costs indicate better routing choices
+   - `overlap_score_weight` balances cache hit optimization against load distribution
+   - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
+### Worker Selection
+The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
+Example calculation with `overlap_score_weight = 1.0`:
+- Worker 1: cost = 1.0 * 8 + 10 = 18
+- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
+- Worker 3: cost = 1.0 * 2 + 9 = 11
+### Using the KV Cache Router
+To enable KV cache-aware routing, start the frontend node like this:
+```bash
 python -m dynamo.frontend --router-mode kv
 ```
@@ -63,23 +115,67 @@ The main KV-aware routing arguments:
 >
 > The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.
-## Prerequisites and Limitations
+To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md).
->[!Note]
+## Basic Routing
-> **KV Router Requirements**: The KV router currently works only with **dynamic endpoints** that are registered via [`register_llm()`](../development/backend-guide.md#writing-python-workers-in-dynamo) with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
+Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
+First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
+```python
+client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
+```
-**Current Limitations (WIP):**
+We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
- **Static endpoints**: Not yet supported. The KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states.
- **Multimodal models**: Not yet supported. The KV router currently tracks token-based blocks only.
+- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
+- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
+- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
+KV Cache routing uses direct routing with a special worker selection algorithm.
+For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md).
+For custom routing logic and advanced patterns, see [Routing Patterns](router_examples.md#routing-patterns) in the examples documentation.
+## Tuning Guidelines
+### 1. Understand Your Workload Characteristics
+- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
+- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
+### 2. Monitor Key Metrics
+The router logs the cost calculation for each worker:
+```text
+Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
+```
-**What this means for your setup:**
+This shows:
-1. Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md) or [example implementations](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/hello_world))
+- Total cost (125.3)
-2. Your handler receives requests with pre-tokenized `token_ids`, not raw text or multimodal inputs
+- Overlap weight × prefill blocks (1.0 × 100.5)
-3. You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
+- Active blocks (25.0)
+- Cached blocks that contribute to overlap (15)
-For basic model registration without KV routing, you can use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
+### 3. Temperature-Based Routing
-## Disaggregated Serving (Prefill and Decode)
+The `router_temperature` parameter controls routing randomness:
+- **0.0 (default)**: Deterministic selection of the best worker
+- **> 0.0**: Probabilistic selection, higher values increase randomness
+- Useful for preventing worker saturation and improving load distribution
+### 4. Iterative Optimization
+1. Begin with default settings
+2. Monitor TTFT and ITL metrics
+3. Adjust `kv-overlap-score-weight` to meet your performance goals:
+   - To reduce TTFT: Increase the weight
+   - To reduce ITL: Decrease the weight
+4. If you observe severe load imbalance, increase the temperature setting
+## Disaggregated Serving
 Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
@@ -158,190 +254,13 @@ graph TD
    linkStyle 5 stroke:#2196f3,stroke-width:2px
 ```
-## Overview
-The KV-aware router operates on two key principles to optimize request routing:
-### Global KV Cache State Synchronization
-KV events from engines are collected by the router to maintain a global view of cached blocks across all workers. The router supports two event transport modes:
-#### Mode 1: JetStream (Default)
-KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts.
- **Best for**: Production deployments requiring durability and multi-replica router consistency
- **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees
-```mermaid
-graph TD
-    subgraph Engines
-        E1[Engine 1<br/>KVPublisher]
-        E2[Engine 2<br/>KVPublisher]
-        E3[Engine 3<br/>KVPublisher]
-    end
-    subgraph "NATS JetStream"
-        JS[(Persistent KV Events Stream<br/>- Block created<br/>- Block removed)]
-    end
-    subgraph "NATS Object Store"
-        OS[(Radix Tree<br/>State Snapshot)]
-    end
-    subgraph "Router Replicas"
-        R1[Router 1<br/>KVIndexer]
-        R2[Router 2<br/>KVIndexer]
-    end
-    E1 -->|Publish Events| JS
-    E2 -->|Publish Events| JS
-    E3 -->|Publish Events| JS
-    JS -->|Consume as Durable Consumer| R1
-    JS -->|Consume as Durable Consumer| R2
-    JS -->|Periodic Snapshot| OS
-    style JS fill:#e1f5fe,stroke:#333,color:#333
-    style OS fill:#e1f5fe,stroke:#333,color:#333
-    style E1 fill:#f3e5f5,stroke:#333,color:#333
-    style E2 fill:#f3e5f5,stroke:#333,color:#333
-    style E3 fill:#f3e5f5,stroke:#333,color:#333
-    style R1 fill:#2e8b57,stroke:#333,color:#fff
-    style R2 fill:#2e8b57,stroke:#333,color:#fff
-    linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px
-```
-#### Mode 2: NATS Core with Local Indexer
-When workers are started with `--enable-local-indexer`, each worker maintains its own local radix tree (local indexer) and publishes events over NATS Core (fire-and-forget pub/sub) instead of JetStream. Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly.
- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios
- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available
- **Enable with**: `--enable-local-indexer` flag on workers (vLLM, mocker)
-```mermaid
-graph TD
-    subgraph Engines
-        E1[Engine 1<br/>LocalKvIndexer]
-        E2[Engine 2<br/>LocalKvIndexer]
-        E3[Engine 3<br/>LocalKvIndexer]
-    end
-    subgraph "NATS Core"
-        NC[KV Events Pub/Sub<br/>- Block created<br/>- Block removed]
-    end
-    subgraph "Router Replicas"
-        R1[Router 1<br/>KVIndexer]
-        R2[Router 2<br/>KVIndexer]
-    end
-    E1 -->|Publish Events| NC
-    E2 -->|Publish Events| NC
-    E3 -->|Publish Events| NC
-    NC -->|Subscribe| R1
-    NC -->|Subscribe| R2
-    style NC fill:#e1f5fe,stroke:#333,color:#333
-    style E1 fill:#f3e5f5,stroke:#333,color:#333
-    style E2 fill:#f3e5f5,stroke:#333,color:#333
-    style E3 fill:#f3e5f5,stroke:#333,color:#333
-    style R1 fill:#2e8b57,stroke:#333,color:#fff
-    style R2 fill:#2e8b57,stroke:#333,color:#fff
-    linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px
-```
-**How gap detection works:**
-1. Each worker assigns monotonically increasing event IDs starting from 0
-2. The router tracks the last received event ID per worker
-3. If an event arrives with `event_id > last_id + 1`, the router detects a gap
-4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]`
-5. On worker discovery (Added event), the router dumps the worker's entire local indexer state
-**Startup behavior:**
- When a worker is discovered, the router queries and ingests its full local indexer state
- When a worker is removed, the router removes all its blocks from the global radix tree
->[!Note]
-> The router automatically selects the transport mode based on worker configuration. If all connected workers have `enable_local_indexer=true`, the router uses NATS Core mode. Otherwise, it uses JetStream mode.
-### Local Active Block Management with Replica Sync
-Second, in addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when:
- The router receives and routes a request
- The first token is generated (prefill complete)
- The response ends (request freed)
-This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging.
-```mermaid
-sequenceDiagram
-    participant C1 as Client 1
-    participant R1 as Router 1<br/>(Slot Manager)
-    participant R2 as Router 2<br/>(Slot Manager)
-    participant C2 as Client 2
-    Note over R1,R2: Router Replica Sync Enabled
-    C1->>R1: Request A
-    activate R1
-    R1->>R1: Predict blocks & route to worker
-    R1-->>R2: Sync: AddRequest(A)
-    C2->>R2: Request B
-    activate R2
-    R2->>R2: Predict blocks & route to worker
-    R2-->>R1: Sync: AddRequest(B)
-    R1->>R1: First token received<br/>(prefill complete)
-    R1-->>R2: Sync: MarkPrefillCompleted(A)
-    R1->>C1: Stream response
-    R2->>R2: First token received<br/>(prefill complete)
-    R2-->>R1: Sync: MarkPrefillCompleted(B)
-    R2->>C2: Stream response
-    R1->>R1: Response complete<br/>(free blocks)
-    R1-->>R2: Sync: Free(A)
-    deactivate R1
-    R2->>R2: Response complete<br/>(free blocks)
-    R2-->>R1: Sync: Free(B)
-    deactivate R2
-    Note over R1,R2: Both routers have consistent<br/>view of active blocks
-```
-This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution.
-## Basic Routing
-Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
-First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
-```python
-client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
-```
-We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
-KV Cache routing uses direct routing with a special worker selection algorithm.
 ## Serving Multiple Router Replicas
 For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.)
 ### Router State Management
-The KV Router tracks two types of state (see [KV Router Architecture](../router/README.md) for details):
+The KV Router tracks two types of state (see [Router Design](../design_docs/router_design.md) for details):
 1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts.
@@ -389,318 +308,10 @@ python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync
 > 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](/docs/design_docs/distributed_runtime.md)) which will start a new stream and NATS object store path
 > 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.
-## Understanding KV Cache
-The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching).
-### KV Cache Optimizations
-Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks.
-Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a
-prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse.
-In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally.
-To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow:
-1. Request tokenization: The incoming prompt is converted into tokens
-2. Block partitioning: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block)
-3. Block hashing: Each block of tokens is hashed to create a unique identifier
-4. Cache lookup:
-    - For each block, the system checks if a matching block already exists in the KV cache
-    - If a match is found, the existing KV cache block is reused
-    - If no match is found, the system proceeds to the next step
-5. Resource allocation:
-    - For blocks without matches, the system attempts to allocate new memory space
-    - If sufficient memory is available, allocate memory space and proceed to step 7
-    - If memory is constrained, proceed to step 6
-6. Cache eviction (when necessary):
-    - The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal
-    - Selected blocks are evicted from the cache
-    - **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.**
-    - Alternatively, some systems may offload less-frequently used blocks to CPU memory.
-7. KV computation:
-    - For new blocks, the model computes key and value tensors
-    - These tensors are stored in the newly allocated cache blocks
-    - **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**.
-Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/).
-## KV Cache Routing and Load Balancing
-```mermaid
-graph TD
-    T[Tokens] --> R[KV Aware Router]
-    R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
-    R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
-    R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
-    style T fill:#fff3e0,stroke:#333,color:#333
-    style R fill:#2e8b57,stroke:#333,color:#fff
-    style W1 fill:#f3e5f5,stroke:#333,color:#333
-    style W2 fill:#c8e6c9,stroke:#333,color:#333
-    style W3 fill:#f3e5f5,stroke:#333,color:#333
-    linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
-```
-KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
-The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
-### Cost Calculation
-1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
-2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
-3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
-   - Lower costs indicate better routing choices
-   - `overlap_score_weight` balances cache hit optimization against load distribution
-   - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
-### Worker Selection
-The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
-Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
-## Events
-### KVPublisher
-The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed.
-The two types of events are:
- KV stored event
- KV removed event
-The publisher can be initialized and used through C bindings or Python bindings.
-### Deterministic Event IDs
-Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's builtin `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect.
-### KVIndexer
-The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.
-The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
-### Inter-Router Communication
-In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
-1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
-2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
-3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
-Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
-## Using KvPushRouter Python API
-Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
->[!Warning]
-> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
-### Methods
-The `KvPushRouter` provides the following methods:
- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
-  - Without `request_id`: Query-only, doesn't update router state
-  - With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
-### Setup
-First, launch your backend engines:
-```bash
-python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
-```
-### Example Script
-```python
-import asyncio
-from dynamollm import DistributedRuntime, KvPushRouter, KvRouterConfig
-async def main():
-    # Get runtime and create endpoint
-    runtime = DistributedRuntime.detached()
-    namespace = runtime.namespace("dynamo")
-    component = namespace.component("backend")
-    endpoint = component.endpoint("generate")
-    # Create KV router
-    kv_router_config = KvRouterConfig()
-    router = KvPushRouter(
-        endpoint=endpoint,
-        block_size=16,
-        kv_router_config=kv_router_config
-    )
-    # Your input tokens
-    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-    # Generate with per-request routing override
-    stream = await router.generate(
-        token_ids=token_ids,
-        model="meta-llama/Llama-2-7b-hf",
-        stop_conditions={
-            "max_tokens": 20,        # Generate exactly 20 tokens
-            "ignore_eos": True,      # Don't stop at EOS token
-        },
-        sampling_options={
-            "temperature": 0.7,
-            "top_p": 0.9,
-        },
-        router_config_override={
-            "overlap_score_weight": 2.0,    # Prioritize cache hits for this request
-            "router_temperature": 0.5,       # Add routing randomness
-        }
-    )
-    # Collect generated tokens
-    generated_tokens = []
-    async for response in stream:
-        if isinstance(response, dict) and "token_ids" in response:
-            generated_tokens.extend(response["token_ids"])
-    print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-### Routing Patterns
-The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
-#### 1. Automatic Routing (Recommended)
-Call `generate()` directly and let the router handle everything:
-```python
-stream = await router.generate(token_ids=tokens, model="model-name")
-```
- **Best for**: Most use cases
- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
-#### 2. Manual State Management (Advanced)
-Use `best_worker(request_id=...)` to select and track, then manage the request yourself:
-```python
-worker_id, _dp_rank, overlap = await router.best_worker(tokens, request_id="req-123")
-response = await client.generate(tokens, request_id="req-123")
-# await anext(response)  # Get first token
-await router.mark_prefill_complete("req-123")  # After first token
-# async for _ in response:  # Continue generating
-#     ...
-await router.free("req-123")  # After completion
-```
- **Best for**: Custom request handling with router state tracking
- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
-#### 3. Hierarchical Router Probing
-Query without state updates, then route through a chosen router:
-```python
-# Probe multiple routers without updating state
-worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens)  # No request_id
-worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens)
-# Pick the best router based on results
-chosen_router = router_1 if overlap_1 > overlap_2 else router_2
-stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
-```
- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
- **Advantage**: Query multiple routers before committing to one
-#### 4. Custom Load-Based Routing
-Use `get_potential_loads()` to implement custom routing logic:
-```python
-loads = await router.get_potential_loads(tokens)
-# Apply custom logic (e.g., weighted scoring, constraints)
-best_worker = min(loads, key=lambda x: custom_cost_fn(x))
-stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
-```
- **Best for**: Custom optimization strategies beyond the built-in cost function
- **Advantage**: Full control over worker selection logic
- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
-All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
-### Custom Routing Example: Minimizing TTFT
-Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
-```python
-import asyncio
-from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig
-async def minimize_ttft_routing():
-    # Setup router
-    runtime = DistributedRuntime.detached()
-    namespace = runtime.namespace("dynamo")
-    component = namespace.component("backend")
-    endpoint = component.endpoint("generate")
-    router = KvPushRouter(
-        endpoint=endpoint,
-        block_size=16,
-        kv_router_config=KvRouterConfig()
-    )
-    # Your input tokens
-    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-    # Get potential loads for all workers
-    potential_loads = await router.get_potential_loads(token_ids)
-    # Find worker with minimum prefill tokens (best for TTFT)
-    best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
-    print(f"Worker loads: {potential_loads}")
-    print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
-    # Route directly to the selected worker
-    stream = await router.generate(
-        token_ids=token_ids,
-        model="meta-llama/Llama-2-7b-hf",
-        worker_id=best_worker['worker_id'],  # Force routing to optimal worker
-        stop_conditions={"max_tokens": 20}
-    )
-    # Process response
-    async for response in stream:
-        if isinstance(response, dict) and "token_ids" in response:
-            print(f"Generated tokens: {response['token_ids']}")
-if __name__ == "__main__":
-    asyncio.run(minimize_ttft_routing())
-```
-This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
-See [KV Router Architecture](../router/README.md) for performance tuning details.
 ## Dynamic Threshold Configuration
+Dynamic threshold configuration allows you to adjust worker busy thresholds at runtime without restarting the frontend, enabling real-time tuning of load balancing behavior based on observed system performance.
 The busy thresholds can be updated at runtime without restarting the frontend. The frontend exposes HTTP endpoints at `/busy_threshold`:
 **Get or set a model's thresholds (POST):**
@@ -730,3 +341,10 @@ curl -X POST http://localhost:8000/busy_threshold \
 curl http://localhost:8000/busy_threshold
 # Response: {"thresholds": [{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}]}
 ```
+## See Also
+- **[Router README](README.md)**: Quick start guide for the KV Router
+- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
+- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes
+- **[KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing
--- a/docs/templates/MIGRATION_GUIDE.md
+++ b/docs/templates/MIGRATION_GUIDE.md
@@ -130,6 +130,52 @@ Check `docs/_includes/` for includes:
 ---
+## Pre-Migration Link Validation
+Before migrating, validate source docs to avoid carrying over broken links.
+### Pre-flight Broken Link Check
+```bash
+# Install lychee (if not available)
+cargo install lychee   # or: brew install lychee
+# Check source files (example: migrating kvbm docs)
+lychee docs/kvbm/ --offline --exclude-path docs/_build
+# Or use the full check with external URLs
+lychee docs/kvbm/ --exclude-path docs/_build
+```
+If lychee is unavailable, use ripgrep to find potentially broken links:
+```bash
+# Find all internal markdown links and spot-check targets
+rg -n '\]\([^http][^)]*\.md' docs/kvbm/
+```
+### Golden Rule
+**Only link to files that exist.** Before adding any link:
+1. Verify the target file exists at the expected path
+2. Test the relative path calculation (count `../` correctly)
+3. For cross-section links, consider using the cross-reference path table
+### Post-Migration Validation
+After moving files, run link check again to catch broken references:
+```bash
+# Check all docs after migration
+lychee docs/ --offline --exclude-path docs/_build
+# Check specific migrated directory (example: after moving to components/kvbm)
+lychee docs/components/kvbm/ --offline
+```
+---
 ## Style Editing Guidelines
 After migrating content, review for FLOW, STYLE, and CONSISTENCY.

--- a/dynamo.code-workspace
+++ b/dynamo.code-workspace
@@ -2,6 +2,9 @@
    "folders": [
        {
            "path": "."
+        },
+        {
+            "path": "../dynamo-tpm"
        }
    ],
    "settings": {

--- a/examples/backends/trtllm/deploy/README.md
+++ b/examples/backends/trtllm/deploy/README.md
@@ -266,7 +266,7 @@ Configure the `model` name and `host` based on your deployment.
 - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
 - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md)
 - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/kv_cache_routing.md)
+- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/README.md)
 - **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md)
 - **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md)
 - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)