feat: migrate router configuration (#6346)

858f33fc · jh-nv · GitHub · af5ace66 · 858f33fc · 858f33fc
Unverified Commit 858f33fc authored Feb 18, 2026 by jh-nv Committed by GitHub Feb 18, 2026
7 changed files
--- a/components/src/dynamo/common/configuration/utils.py
+++ b/components/src/dynamo/common/configuration/utils.py
@@ -116,6 +116,7 @@ def add_negatable_bool_argument(
    default: bool,
    help: str,
    dest: Optional[str] = None,
+    obsolete_flag: Optional[str] = None,
 ) -> None:
    """
    Add negatable boolean flag (--foo / --no-foo).
@@ -126,6 +127,8 @@ def add_negatable_bool_argument(
        env_var: Environment variable name (e.g., "DYN_ENABLE_FEATURE")
        default: Default value
        help: Help text
+        dest: Optional destination name for the parsed value
+        obsolete_flag: Optional obsolete/legacy flag (for help msg only, must start with '--')
    """
    add_argument(
        parser,
@@ -134,6 +137,7 @@ def add_negatable_bool_argument(
        default=default,
        help=help,
        dest=dest,
+        obsolete_flag=obsolete_flag,
        arg_type=None,
        action=argparse.BooleanOptionalAction,
    )

--- a/components/src/dynamo/router/README.md
+++ b/components/src/dynamo/router/README.md
@@ -18,9 +18,9 @@ This component is **fully configurable** and works with any Dynamo backend (vLLM
 ```bash
 python -m dynamo.router \
    --endpoint dynamo.prefill.generate \
-    --block-size 64 \
+    --router-block-size 64 \
    --router-reset-states \
-    --no-track-active-blocks
+    --no-router-track-active-blocks
 ```
 ### Arguments
@@ -29,16 +29,16 @@ python -m dynamo.router \
 - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
 **Router Configuration:**
-For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/pages/components/router/router-guide.md).
+All router options use the `--router-*` prefix (e.g., `--router-block-size`, `--router-kv-overlap-score-weight`, `--router-temperature`, `--router-kv-events` / `--no-router-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, `--router-track-active-blocks` / `--no-router-track-active-blocks`). Legacy names without the prefix (e.g., `--block-size`, `--kv-events`) are still accepted but deprecated. For detailed descriptions, see the [Router Guide](/docs/pages/components/router/router-guide.md).
 ## Architecture
 The standalone router exposes two endpoints via the Dynamo runtime:
-1. **`find_best_worker`**: Given a request with token IDs, returns the best worker to handle it
+1. **`generate`**: Routes requests to the best worker and streams back generation results (KV-aware routing).
-2. **`free`**: Cleans up router state when a request completes
+2. **`best_worker_id`**: Given token IDs, returns the best worker ID for the request without routing; useful for debugging or custom routing logic.
-Clients query the `find_best_worker` endpoint to determine which worker should process each request, then call the selected worker directly.
+Clients call the `generate` endpoint to stream completions, or call `best_worker_id` to decide which worker to use and then contact that worker directly.
 ## Example: Manual Disaggregated Serving (Alternative Setup)
@@ -59,9 +59,9 @@ python -m dynamo.frontend \
 # Start standalone router for prefill workers
 python -m dynamo.router \
    --endpoint dynamo.prefill.generate \
-    --block-size 64 \
+    --router-block-size 64 \
    --router-reset-states \
-    --no-track-active-blocks
+    --no-router-track-active-blocks
 # Start decode workers
 python -m dynamo.vllm --model MODEL_NAME --block-size 64 &
@@ -71,10 +71,10 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker &
 ```
 >[!Note]
-> **Why `--no-track-active-blocks` for prefill routing?**
+> **Why `--no-router-track-active-blocks` for prefill routing?**
 > Active block tracking is used for load balancing across decode (generation) phases. For prefill-only routing, decode load is not relevant, so disabling this reduces overhead and simplifies the router state.
 >
-> **Why `--block-size` is required for standalone routers:**
+> **Why `--router-block-size` is required for standalone routers:**
 > Unlike the frontend router which can infer block size from the ModelDeploymentCard (MDC) during worker registration, standalone routers cannot access the MDC and must have the block size explicitly specified. This is a work in progress to enable automatic inference.
 ## Configuration Best Practices
@@ -82,8 +82,8 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker &
 >[!Note]
 > **Block Size Matching:**
 > The block size must match across:
-> - Standalone router (`--block-size`)
+> - Standalone router (`--router-block-size`)
-> - All worker instances (`--block-size`)
+> - All worker instances (backend-specific, e.g. `--block-size` for vLLM)
 >
 > **Endpoint Matching:**
 > The `--endpoint` argument must match where your target workers register. For example:
@@ -95,9 +95,9 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker &
 To integrate the standalone router with a backend:
-1. Clients should query the `router.find_best_worker` endpoint before sending requests
+1. Workers should register at the endpoint specified by the `--endpoint` argument
-2. Workers should register at the endpoint specified by the `--endpoint` argument
+2. Clients call the `router.generate` endpoint to stream completions (router selects the best worker), or call `router.best_worker_id` to get the best worker ID and then send requests to that worker
-3. Clients should call the `router.free` endpoint when requests complete
+3. Router state is updated automatically as requests are routed; no separate "free" call is required
 See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a reference implementation (search for `prefill_router_client`).

--- a/components/src/dynamo/router/__main__.py
+++ b/components/src/dynamo/router/__main__.py
@@ -12,15 +12,16 @@ to prefill workers) or any other scenario requiring intelligent KV cache-aware
 routing decisions.
 """
-import argparse
 import asyncio
 import logging
-import os
 from typing import Optional
 import uvloop
 from dynamo.llm import KvRouter, KvRouterConfig
+from dynamo.router.args import build_kv_router_config
+from dynamo.router.args import parse_args as parse_router_args
+from dynamo.router.backend_args import DynamoRouterConfig
 from dynamo.runtime import Client, DistributedRuntime, dynamo_worker
 from dynamo.runtime.logging import configure_dynamo_logging
@@ -151,192 +152,42 @@ class StandaloneRouterHandler:
        yield worker_id
-def parse_args():
+def parse_args(argv=None) -> DynamoRouterConfig:
-    parser = argparse.ArgumentParser(
+    """Parse router CLI arguments (compatibility shim delegating to args.parse_args)."""
-        description="Dynamo Standalone Router Service: Configurable KV-aware routing for any worker endpoint",
+    return parse_router_args(argv)
-        formatter_class=argparse.RawTextHelpFormatter,
-    )
-    parser.add_argument(
-        "--endpoint",
-        type=str,
-        required=True,
-        help=(
-            "Full endpoint path for workers in the format namespace.component.endpoint\n"
-            "(e.g., dynamo.prefill.generate for prefill workers)"
-        ),
-    )
-    parser.add_argument(
-        "--block-size",
-        type=int,
-        default=128,
-        help="KV cache block size for routing decisions (default: 128)",
-    )
-    parser.add_argument(
-        "--kv-overlap-score-weight",
-        type=float,
-        default=1.0,
-        help="KV Router: Weight for overlap score in worker selection. Higher values prioritize KV cache reuse (default: 1.0)",
-    )
-    parser.add_argument(
-        "--router-temperature",
-        type=float,
-        default=0.0,
-        help="KV Router: Temperature for worker sampling via softmax. Higher values promote more randomness, and 0 fallbacks to deterministic (default: 0.0)",
-    )
-    parser.add_argument(
-        "--no-kv-events",
-        action="store_false",
-        dest="use_kv_events",
-        default=True,
-        help="KV Router: Disable KV events. When set, the router predicts cache state based on routing decisions with TTL-based expiration and pruning, rather than receiving events from workers. By default, KV events are enabled.",
-    )
-    parser.add_argument(
-        "--router-replica-sync",
-        action="store_true",
-        default=False,
-        help="KV Router: Enable replica synchronization across multiple router instances. When true, routers will publish and subscribe to events to maintain consistent state (default: False)",
-    )
-    parser.add_argument(
-        "--router-snapshot-threshold",
-        type=int,
-        default=1000000,
-        help="KV Router: Number of messages in stream before triggering a snapshot (default: 1000000)",
-    )
-    parser.add_argument(
-        "--router-reset-states",
-        action="store_true",
-        dest="router_reset_states",
-        default=False,
-        help="KV Router: Reset router state on startup, purging stream and object store. By default, states are persisted. WARNING: This can affect existing router replicas (default: False)",
-    )
-    parser.add_argument(
-        "--durable-kv-events",
-        action="store_true",
-        dest="durable_kv_events",
-        default=False,
-        help="KV Router: Enable durable KV events using NATS JetStream instead of NATS Core. By default, the router uses the generic event plane (NATS Core or ZMQ) with local_indexer mode. Use this flag when you need durability and multi-replica consistency. Requires NATS with JetStream enabled.",
-    )
-    parser.add_argument(
-        "--no-track-active-blocks",
-        action="store_false",
-        dest="router_track_active_blocks",
-        default=True,
-        help="KV Router: Disable tracking of active blocks (blocks being used for ongoing generation). By default, active blocks are tracked for load balancing (default: True)",
-    )
-    parser.add_argument(
-        "--no-assume-kv-reuse",
-        action="store_false",
-        dest="router_assume_kv_reuse",
-        default=True,
-        help="KV Router: When tracking active blocks, do not assume KV cache reuse (generate random hashes instead of computing actual block hashes). Useful when KV cache reuse is not expected. By default, KV cache reuse is assumed.",
-    )
-    parser.add_argument(
-        "--track-output-blocks",
-        action="store_true",
-        dest="router_track_output_blocks",
-        default=False,
-        help="KV Router: Track output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward expected output sequence length (agent_hints.osl in nvext). Default: False.",
-    )
-    parser.add_argument(
-        "--router-ttl-secs",
-        type=float,
-        default=120.0,
-        help="KV Router: TTL for blocks in seconds. Only used when --no-kv-events is set. Controls how long cached blocks are considered valid without explicit events (default: 120.0)",
-    )
-    parser.add_argument(
-        "--router-max-tree-size",
-        type=int,
-        default=2**20,
-        help="KV Router: Maximum tree size before pruning. Only used when --no-kv-events is set. When the indexer tree exceeds this size, pruning is triggered (default: 1048576, which is 2^20)",
-    )
-    parser.add_argument(
-        "--router-prune-target-ratio",
-        type=float,
-        default=0.8,
-        help="KV Router: Target size ratio after pruning (0.0-1.0). Only used when --no-kv-events is set. Determines how aggressively to prune the tree (default: 0.8)",
-    )
-    parser.add_argument(
-        "--router-event-threads",
-        type=int,
-        default=int(os.environ.get("DYN_ROUTER_EVENT_THREADS", "1")),
-        help="KV Router: Number of event processing threads. When > 1, uses a concurrent radix tree with a thread pool for higher throughput. Can be set via DYN_ROUTER_EVENT_THREADS env var (default: 1).",
-    )
-    return parser.parse_args()
 @dynamo_worker()
 async def worker(runtime: DistributedRuntime):
    """Main worker function for the standalone router service."""
-    args = parse_args()
+    config = parse_args()
-    # Parse endpoint path to get namespace for service registration
-    endpoint_parts = args.endpoint.split(".")
-    if len(endpoint_parts) != 3:
-        raise ValueError(
-            f"Invalid endpoint path format: {args.endpoint}. "
-            "Expected format: namespace.component.endpoint"
-        )
-    namespace = endpoint_parts[0]
    logger.info("Starting Standalone Router Service")
    logger.debug(
-        f"Configuration: endpoint={args.endpoint}, block_size={args.block_size}, "
+        f"Configuration: endpoint={config.endpoint}, router_block_size={config.router_block_size}, "
-        f"overlap_score_weight={args.kv_overlap_score_weight}, "
+        f"overlap_score_weight={config.router_kv_overlap_score_weight}, "
-        f"router_temperature={args.router_temperature}, "
+        f"router_temperature={config.router_temperature}, "
-        f"use_kv_events={args.use_kv_events}, "
+        f"router_use_kv_events={config.router_use_kv_events}, "
-        f"durable_kv_events={args.durable_kv_events}, "
+        f"router_durable_kv_events={config.router_durable_kv_events}, "
-        f"router_replica_sync={args.router_replica_sync}, "
+        f"router_replica_sync={config.router_replica_sync}, "
-        f"router_reset_states={args.router_reset_states}, "
+        f"router_reset_states={config.router_reset_states}, "
-        f"router_track_active_blocks={args.router_track_active_blocks}, "
+        f"router_track_active_blocks={config.router_track_active_blocks}, "
-        f"router_track_output_blocks={args.router_track_output_blocks}, "
+        f"router_track_output_blocks={config.router_track_output_blocks}, "
-        f"router_assume_kv_reuse={args.router_assume_kv_reuse}, "
+        f"router_assume_kv_reuse={config.router_assume_kv_reuse}, "
-        f"router_ttl_secs={args.router_ttl_secs}, "
+        f"router_ttl_secs={config.router_ttl_secs}, "
-        f"router_max_tree_size={args.router_max_tree_size}, "
+        f"router_max_tree_size={config.router_max_tree_size}, "
-        f"router_prune_target_ratio={args.router_prune_target_ratio}"
+        f"router_prune_target_ratio={config.router_prune_target_ratio}"
    )
-    # Create KvRouter configuration
+    kv_router_config = build_kv_router_config(config)
-    kv_router_config = KvRouterConfig(
-        overlap_score_weight=args.kv_overlap_score_weight,
-        router_temperature=args.router_temperature,
-        use_kv_events=args.use_kv_events,
-        durable_kv_events=args.durable_kv_events,
-        router_replica_sync=args.router_replica_sync,
-        router_track_active_blocks=args.router_track_active_blocks,
-        router_track_output_blocks=args.router_track_output_blocks,
-        router_assume_kv_reuse=args.router_assume_kv_reuse,
-        router_snapshot_threshold=args.router_snapshot_threshold,
-        router_reset_states=args.router_reset_states,
-        router_ttl_secs=args.router_ttl_secs,
-        router_max_tree_size=args.router_max_tree_size,
-        router_prune_target_ratio=args.router_prune_target_ratio,
-        router_event_threads=args.router_event_threads,
-    )
    # Create service component - use "router" as component name
-    component = runtime.namespace(namespace).component("router")
+    component = runtime.namespace(config.namespace).component("router")
    # Create handler
    handler = StandaloneRouterHandler(
-        runtime, args.endpoint, args.block_size, kv_router_config
+        runtime, config.endpoint, config.router_block_size, kv_router_config
    )
    await handler.initialize()

--- a/components/src/dynamo/router/args.py
+++ b/components/src/dynamo/router/args.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Router CLI parsing and config assembly."""
+import argparse
+from dynamo.llm import KvRouterConfig
+from .backend_args import DynamoRouterArgGroup, DynamoRouterConfig
+def build_kv_router_config(router_config: DynamoRouterConfig) -> KvRouterConfig:
+    """Build KvRouterConfig from DynamoRouterConfig.
+    Maps CLI/config attribute names to KvRouterConfig constructor kwargs.
+    The only name difference is router_kv_overlap_score_weight -> overlap_score_weight.
+    """
+    return KvRouterConfig(
+        overlap_score_weight=router_config.router_kv_overlap_score_weight,
+        router_temperature=router_config.router_temperature,
+        use_kv_events=router_config.router_use_kv_events,
+        durable_kv_events=router_config.router_durable_kv_events,
+        router_replica_sync=router_config.router_replica_sync,
+        router_track_active_blocks=router_config.router_track_active_blocks,
+        router_track_output_blocks=router_config.router_track_output_blocks,
+        router_assume_kv_reuse=router_config.router_assume_kv_reuse,
+        router_snapshot_threshold=router_config.router_snapshot_threshold,
+        router_reset_states=router_config.router_reset_states,
+        router_ttl_secs=router_config.router_ttl_secs,
+        router_max_tree_size=router_config.router_max_tree_size,
+        router_prune_target_ratio=router_config.router_prune_target_ratio,
+        router_event_threads=router_config.router_event_threads,
+    )
+def parse_args(argv=None) -> DynamoRouterConfig:
+    """Parse command-line arguments for the standalone router.
+    Returns:
+        DynamoRouterConfig: Parsed and validated configuration.
+    """
+    parser = argparse.ArgumentParser(
+        description="Dynamo Standalone Router Service: Configurable KV-aware routing for any worker endpoint",
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+    group = DynamoRouterArgGroup()
+    group.add_arguments(parser)
+    args = parser.parse_args(argv)
+    config = DynamoRouterConfig.from_cli_args(args)
+    config.validate()
+    return config
--- a/components/src/dynamo/router/backend_args.py
+++ b/components/src/dynamo/router/backend_args.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Dynamo standalone router configuration ArgGroup."""
+from dynamo.common.configuration.arg_group import ArgGroup
+from dynamo.common.configuration.config_base import ConfigBase
+from dynamo.common.configuration.utils import add_argument, add_negatable_bool_argument
+class DynamoRouterConfig(ConfigBase):
+    """Typed configuration for the standalone KV router (router-owned options only)."""
+    namespace: str
+    endpoint: str
+    router_block_size: int
+    router_kv_overlap_score_weight: float
+    router_temperature: float
+    router_use_kv_events: bool
+    router_replica_sync: bool
+    router_snapshot_threshold: int
+    router_reset_states: bool
+    router_durable_kv_events: bool
+    router_track_active_blocks: bool
+    router_assume_kv_reuse: bool
+    router_track_output_blocks: bool
+    router_ttl_secs: float
+    router_max_tree_size: int
+    router_prune_target_ratio: float
+    router_event_threads: int
+    def validate(self) -> None:
+        """Validate config invariants (aligned with Rust KvRouterConfig where applicable)."""
+        if not self.endpoint:
+            raise ValueError(
+                "endpoint is required (set --endpoint or DYN_ROUTER_ENDPOINT)"
+            )
+        parts = self.endpoint.split(".")
+        if len(parts) != 3:
+            raise ValueError(
+                f"Invalid endpoint format: {self.endpoint!r}. "
+                "Expected format: namespace.component.endpoint"
+            )
+        self.namespace = parts[0]
+class DynamoRouterArgGroup(ArgGroup):
+    """CLI argument group for standalone router options."""
+    name = "dynamo-router"
+    def add_arguments(self, parser) -> None:
+        """Add router-owned arguments to parser."""
+        g = parser.add_argument_group("Dynamo Router Options")
+        add_argument(
+            g,
+            flag_name="--endpoint",
+            env_var="DYN_ROUTER_ENDPOINT",
+            default=None,
+            help="Full endpoint path for workers in the format namespace.component.endpoint (e.g., dynamo.prefill.generate for prefill workers)",
+            arg_type=str,
+        )
+        add_argument(
+            g,
+            flag_name="--router-block-size",
+            env_var="DYN_ROUTER_BLOCK_SIZE",
+            default=128,
+            help="KV cache block size for routing decisions",
+            arg_type=int,
+            obsolete_flag="--block-size",
+        )
+        add_argument(
+            g,
+            flag_name="--router-kv-overlap-score-weight",
+            env_var="DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT",
+            default=1.0,
+            help="KV Router: Weight for overlap score in worker selection. Higher values prioritize KV cache reuse",
+            arg_type=float,
+            obsolete_flag="--kv-overlap-score-weight",
+        )
+        add_argument(
+            g,
+            flag_name="--router-temperature",
+            env_var="DYN_ROUTER_TEMPERATURE",
+            default=0.0,
+            help="KV Router: Temperature for worker sampling via softmax. Higher values promote more randomness, and 0 fallbacks to deterministic.",
+            arg_type=float,
+        )
+        add_negatable_bool_argument(
+            g,
+            flag_name="--router-kv-events",
+            env_var="DYN_ROUTER_USE_KV_EVENTS",
+            default=True,
+            help="KV Router: Enable KV events from workers. When disabled (--no-router-kv-events), the router predicts cache state based on routing decisions with TTL-based expiration and pruning, rather than receiving events from workers.",
+            dest="router_use_kv_events",
+            obsolete_flag="--kv-events",
+        )
+        add_negatable_bool_argument(
+            g,
+            flag_name="--router-replica-sync",
+            env_var="DYN_ROUTER_REPLICA_SYNC",
+            default=False,
+            help="KV Router: Enable replica synchronization across multiple router instances. When true, routers will publish and subscribe to events to maintain consistent state.",
+        )
+        add_argument(
+            g,
+            flag_name="--router-snapshot-threshold",
+            env_var="DYN_ROUTER_SNAPSHOT_THRESHOLD",
+            default=1000000,
+            help="KV Router: Number of messages in stream before triggering a snapshot",
+            arg_type=int,
+        )
+        add_negatable_bool_argument(
+            g,
+            flag_name="--router-reset-states",
+            env_var="DYN_ROUTER_RESET_STATES",
+            default=False,
+            help="KV Router: Reset router state on startup, purging stream and object store. WARNING: Can affect existing router replicas.",
+        )
+        add_negatable_bool_argument(
+            g,
+            flag_name="--router-durable-kv-events",
+            env_var="DYN_ROUTER_DURABLE_KV_EVENTS",
+            default=False,
+            help="KV Router: Enable durable KV events using NATS JetStream instead of NATS Core. By default, the router uses the generic event plane (NATS Core or ZMQ) with local_indexer mode. Use this flag when you need durability and multi-replica consistency. Requires NATS with JetStream enabled.",
+            obsolete_flag="--durable-kv-events",
+        )
+        add_negatable_bool_argument(
+            g,
+            flag_name="--router-track-active-blocks",
+            env_var="DYN_ROUTER_TRACK_ACTIVE_BLOCKS",
+            default=True,
+            help="KV Router: Track active blocks for load balancing. Use --no-router-track-active-blocks to disable",
+            obsolete_flag="--track-active-blocks",
+        )
+        add_negatable_bool_argument(
+            g,
+            flag_name="--router-assume-kv-reuse",
+            env_var="DYN_ROUTER_ASSUME_KV_REUSE",
+            default=True,
+            help="KV Router: When tracking active blocks, assume KV cache reuse. Use --no-router-assume-kv-reuse to use random hashes, useful when KV cache reuse is not expected.",
+            obsolete_flag="--assume-kv-reuse",
+        )
+        add_negatable_bool_argument(
+            g,
+            flag_name="--router-track-output-blocks",
+            env_var="DYN_ROUTER_TRACK_OUTPUT_BLOCKS",
+            default=False,
+            help="KV Router: Track output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward expected output sequence length (agent_hints.osl in nvext).",
+            obsolete_flag="--track-output-blocks",
+        )
+        add_argument(
+            g,
+            flag_name="--router-ttl-secs",
+            env_var="DYN_ROUTER_TTL_SECS",
+            default=120.0,
+            help="KV Router: TTL for blocks in seconds. Only used when --no-router-kv-events is set.  Controls how long cached blocks are considered valid without explicit events.",
+            arg_type=float,
+        )
+        add_argument(
+            g,
+            flag_name="--router-max-tree-size",
+            env_var="DYN_ROUTER_MAX_TREE_SIZE",
+            default=2**20,
+            help="KV Router: Maximum tree size before pruning. Only used when --no-router-kv-events is set.  When the indexer tree exceeds this size, pruning is triggered.",
+            arg_type=int,
+        )
+        add_argument(
+            g,
+            flag_name="--router-prune-target-ratio",
+            env_var="DYN_ROUTER_PRUNE_TARGET_RATIO",
+            default=0.8,
+            help="KV Router: Target size ratio after pruning (0.0-1.0). Only used when --no-router-kv-events is set. Determines how aggressively to prune the tree.",
+            arg_type=float,
+        )
+        add_argument(
+            g,
+            flag_name="--router-event-threads",
+            env_var="DYN_ROUTER_EVENT_THREADS",
+            default=1,
+            help="KV Router: Number of event processing threads. >1 uses concurrent radix tree and thread pool for higher throughput.",
+            arg_type=int,
+        )
--- a/docs/pages/components/router/agent-hints.md
+++ b/docs/pages/components/router/agent-hints.md
@@ -49,11 +49,11 @@ A request with `latency_sensitivity: 5.0` arriving at time `T` is treated as if
 Expected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways:
-1. **Output block tracking**: When `--track-output-blocks` is enabled, the router adds placeholder blocks during generation and applies fractional decay based on progress toward `osl`. This gives the router a more accurate picture of each worker's KV cache utilization for long-running requests.
+1. **Output block tracking**: When output block tracking is enabled (frontend: `--track-output-blocks`; standalone router: `--router-track-output-blocks`), the router adds placeholder blocks during generation and applies fractional decay based on progress toward `osl`. This gives the router a more accurate picture of each worker's KV cache utilization for long-running requests.
 2. **Resource estimation**: Helps the router estimate total resource requirements when making routing decisions.
 - **Type**: `u32` (optional)
- **Requires**: `--track-output-blocks` for output block tracking behavior
+- **Requires**: `--track-output-blocks` (frontend) or `--router-track-output-blocks` (standalone router) for output block tracking behavior
 ### Example

--- a/docs/pages/components/router/router-guide.md
+++ b/docs/pages/components/router/router-guide.md
@@ -310,7 +310,7 @@ await prefill_endpoint.serve_endpoint(prefill_handler.generate)
 ```
 > [!Note]
-> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
+> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags (e.g., `--router-block-size`, `--router-kv-events`). See the [Standalone Router README](../../../../components/src/dynamo/router/README.md) and example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
 ### Request Flow