docs: fix typos and wrong imports in router docs (#6654)

Signed-off-by: PeaBrane <yanrpei@gmail.com>

docs: fix typos and wrong imports in router docs (#6654)
Signed-off-by: PeaBrane <yanrpei@gmail.com>
002c6823 · Yan Ru Pei · GitHub · 64574a3e · 002c6823 · 002c6823
Unverified Commit 002c6823 authored Feb 26, 2026 by Yan Ru Pei Committed by GitHub Feb 26, 2026
3 changed files
--- a/docs/pages/components/router/README.md
+++ b/docs/pages/components/router/README.md
@@ -31,7 +31,7 @@ Backend workers register themselves using the `register_model` API, after which
 | `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
 | `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
 | `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking |
-| `--router-kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT). |
+| `--router-kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |

 For all available options: `python -m dynamo.frontend --help`


--- a/docs/pages/components/router/router-examples.md
+++ b/docs/pages/components/router/router-examples.md
@@ -51,11 +51,13 @@ python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf

 ```python
 import asyncio
-from dynamollm import DistributedRuntime, KvRouter, KvRouterConfig
+from dynamo.runtime import DistributedRuntime
+from dynamo.llm import KvRouter, KvRouterConfig

 async def main():
    # Get runtime and create endpoint
-    runtime = DistributedRuntime.detached()
+    loop = asyncio.get_running_loop()
+    runtime = DistributedRuntime(loop, "etcd", "nats")
    endpoint = runtime.endpoint("dynamo.backend.generate")

    # Create KV router
@@ -191,9 +193,12 @@ Query without state updates, then route through a chosen router:
 worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens)  # No request_id
 worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens)

-# Pick the best router based on results
-chosen_router = router_1 if overlap_1 > overlap_2 else router_2
-stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
+# Pick the best router and corresponding worker based on results
+if overlap_1 > overlap_2:
+    chosen_router, chosen_worker = router_1, worker_id_1
+else:
+    chosen_router, chosen_worker = router_2, worker_id_2
+stream = await chosen_router.generate(tokens, model="model-name", worker_id=chosen_worker)
 ```
 - **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
 - **Advantage**: Query multiple routers before committing to one
@@ -218,11 +223,13 @@ Here's an example of using `get_potential_loads()` to implement custom routing t

 ```python
 import asyncio
-from dynamo.llm import DistributedRuntime, KvRouter, KvRouterConfig
+from dynamo.runtime import DistributedRuntime
+from dynamo.llm import KvRouter, KvRouterConfig

 async def minimize_ttft_routing():
    # Setup router
-    runtime = DistributedRuntime.detached()
+    loop = asyncio.get_running_loop()
+    runtime = DistributedRuntime(loop, "etcd", "nats")
    endpoint = runtime.endpoint("dynamo.backend.generate")

    router = KvRouter(

--- a/docs/pages/components/router/router-guide.md
+++ b/docs/pages/components/router/router-guide.md
@@ -35,7 +35,7 @@ Backend workers register themselves using the `register_model` API, after which
 | `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
 | `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
 | `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking |
-| `--router-kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT). |
+| `--router-kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
 | `--router-queue-threshold <float>` | None (disabled) | Queue threshold fraction; enables priority scheduling via `latency_sensitivity` |

 For all available options: `python -m dynamo.frontend --help`
@@ -142,7 +142,7 @@ To evaluate the benefits of KV-aware routing, compare your workload's performanc

 The main KV-aware routing arguments (frontend uses the same `--router-*` flag names as the standalone router; legacy names without the prefix are obsolete):

- `--router-kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1. .
+- `--router-kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.

 - `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.

@@ -156,7 +156,7 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na

 - `--router-snapshot-threshold`: Only applies in JetStream mode (`--router-durable-kv-events`). Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATS object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.

- `--no-router-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management. .
+- `--no-router-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management.

 - `--router-track-output-blocks`: Enables tracking of output blocks during generation (default: disabled). When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward the expected output sequence length (`agent_hints.osl` in nvext). This improves load balancing accuracy for long-running generation requests by accounting for output-side KV cache growth.

@@ -389,7 +389,7 @@ Persistence behavior depends on which event transport mode is active:
 - Recovery depends on workers being available; if a worker is down, its blocks cannot be recovered
 - Simpler infrastructure (no JetStream required)

-**JetStream Mode** (`--router-durable-kv-events` on **both** frontend **and** workers):**
+**JetStream Mode** (`--router-durable-kv-events` on **both** frontend **and** workers):
 - Prefix blocks are stored in NATS JetStream with 1-hour retention
 - Snapshots saved to NATS object store at configurable thresholds
 - New replicas automatically restore this state on startup