@@ -18,9 +18,9 @@ This component is **fully configurable** and works with any Dynamo backend (vLLM
...
@@ -18,9 +18,9 @@ This component is **fully configurable** and works with any Dynamo backend (vLLM
```bash
```bash
python -m dynamo.router \
python -m dynamo.router \
--endpoint dynamo.prefill.generate \
--endpoint dynamo.prefill.generate \
--block-size 64 \
--router-block-size 64 \
--router-reset-states\
--router-reset-states\
--no-track-active-blocks
--no-router-track-active-blocks
```
```
### Arguments
### Arguments
...
@@ -29,16 +29,16 @@ python -m dynamo.router \
...
@@ -29,16 +29,16 @@ python -m dynamo.router \
-`--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
-`--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
**Router Configuration:**
**Router Configuration:**
For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/pages/components/router/router-guide.md).
All router options use the `--router-*` prefix (e.g., `--router-block-size`, `--router-kv-overlap-score-weight`, `--router-temperature`, `--router-kv-events` / `--no-router-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, `--router-track-active-blocks` / `--no-router-track-active-blocks`). Legacy names without the prefix (e.g., `--block-size`, `--kv-events`) are still accepted but deprecated. For detailed descriptions, see the [Router Guide](/docs/pages/components/router/router-guide.md).
## Architecture
## Architecture
The standalone router exposes two endpoints via the Dynamo runtime:
The standalone router exposes two endpoints via the Dynamo runtime:
1.**`find_best_worker`**: Given a request with token IDs, returns the best worker to handle it
1.**`generate`**: Routes requests to the best worker and streams back generation results (KV-aware routing).
2.**`free`**: Cleans up router state when a request completes
2.**`best_worker_id`**: Given token IDs, returns the best worker ID for the request without routing; useful for debugging or custom routing logic.
Clients query the `find_best_worker` endpoint to determine which worker should process each request, then call the selected worker directly.
Clients call the `generate` endpoint to stream completions, or call `best_worker_id` to decide which worker to use and then contact that worker directly.
> **Why `--no-track-active-blocks` for prefill routing?**
> **Why `--no-router-track-active-blocks` for prefill routing?**
> Active block tracking is used for load balancing across decode (generation) phases. For prefill-only routing, decode load is not relevant, so disabling this reduces overhead and simplifies the router state.
> Active block tracking is used for load balancing across decode (generation) phases. For prefill-only routing, decode load is not relevant, so disabling this reduces overhead and simplifies the router state.
>
>
> **Why `--block-size` is required for standalone routers:**
> **Why `--router-block-size` is required for standalone routers:**
> Unlike the frontend router which can infer block size from the ModelDeploymentCard (MDC) during worker registration, standalone routers cannot access the MDC and must have the block size explicitly specified. This is a work in progress to enable automatic inference.
> Unlike the frontend router which can infer block size from the ModelDeploymentCard (MDC) during worker registration, standalone routers cannot access the MDC and must have the block size explicitly specified. This is a work in progress to enable automatic inference.
To integrate the standalone router with a backend:
To integrate the standalone router with a backend:
1.Clients should query the `router.find_best_worker` endpoint before sending requests
1.Workers should register at the endpoint specified by the `--endpoint` argument
2.Workers should registerat the endpoint specified by the `--endpoint` argument
2.Clients call the `router.generate` endpoint to stream completions (router selects the best worker), or call `router.best_worker_id` to get the best worker ID and then send requests to that worker
3.Clients should call the `router.free` endpoint when requests complete
3.Router state is updated automatically as requests are routed; no separate "free" call is required
See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a reference implementation (search for `prefill_router_client`).
See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a reference implementation (search for `prefill_router_client`).
@@ -151,192 +152,42 @@ class StandaloneRouterHandler:
...
@@ -151,192 +152,42 @@ class StandaloneRouterHandler:
yieldworker_id
yieldworker_id
defparse_args():
defparse_args(argv=None)->DynamoRouterConfig:
parser=argparse.ArgumentParser(
"""Parse router CLI arguments (compatibility shim delegating to args.parse_args)."""
description="Dynamo Standalone Router Service: Configurable KV-aware routing for any worker endpoint",
returnparse_router_args(argv)
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument(
"--endpoint",
type=str,
required=True,
help=(
"Full endpoint path for workers in the format namespace.component.endpoint\n"
"(e.g., dynamo.prefill.generate for prefill workers)"
),
)
parser.add_argument(
"--block-size",
type=int,
default=128,
help="KV cache block size for routing decisions (default: 128)",
)
parser.add_argument(
"--kv-overlap-score-weight",
type=float,
default=1.0,
help="KV Router: Weight for overlap score in worker selection. Higher values prioritize KV cache reuse (default: 1.0)",
)
parser.add_argument(
"--router-temperature",
type=float,
default=0.0,
help="KV Router: Temperature for worker sampling via softmax. Higher values promote more randomness, and 0 fallbacks to deterministic (default: 0.0)",
)
parser.add_argument(
"--no-kv-events",
action="store_false",
dest="use_kv_events",
default=True,
help="KV Router: Disable KV events. When set, the router predicts cache state based on routing decisions with TTL-based expiration and pruning, rather than receiving events from workers. By default, KV events are enabled.",
)
parser.add_argument(
"--router-replica-sync",
action="store_true",
default=False,
help="KV Router: Enable replica synchronization across multiple router instances. When true, routers will publish and subscribe to events to maintain consistent state (default: False)",
)
parser.add_argument(
"--router-snapshot-threshold",
type=int,
default=1000000,
help="KV Router: Number of messages in stream before triggering a snapshot (default: 1000000)",
)
parser.add_argument(
"--router-reset-states",
action="store_true",
dest="router_reset_states",
default=False,
help="KV Router: Reset router state on startup, purging stream and object store. By default, states are persisted. WARNING: This can affect existing router replicas (default: False)",
)
parser.add_argument(
"--durable-kv-events",
action="store_true",
dest="durable_kv_events",
default=False,
help="KV Router: Enable durable KV events using NATS JetStream instead of NATS Core. By default, the router uses the generic event plane (NATS Core or ZMQ) with local_indexer mode. Use this flag when you need durability and multi-replica consistency. Requires NATS with JetStream enabled.",
)
parser.add_argument(
"--no-track-active-blocks",
action="store_false",
dest="router_track_active_blocks",
default=True,
help="KV Router: Disable tracking of active blocks (blocks being used for ongoing generation). By default, active blocks are tracked for load balancing (default: True)",
)
parser.add_argument(
"--no-assume-kv-reuse",
action="store_false",
dest="router_assume_kv_reuse",
default=True,
help="KV Router: When tracking active blocks, do not assume KV cache reuse (generate random hashes instead of computing actual block hashes). Useful when KV cache reuse is not expected. By default, KV cache reuse is assumed.",
)
parser.add_argument(
"--track-output-blocks",
action="store_true",
dest="router_track_output_blocks",
default=False,
help="KV Router: Track output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward expected output sequence length (agent_hints.osl in nvext). Default: False.",
)
parser.add_argument(
"--router-ttl-secs",
type=float,
default=120.0,
help="KV Router: TTL for blocks in seconds. Only used when --no-kv-events is set. Controls how long cached blocks are considered valid without explicit events (default: 120.0)",
)
parser.add_argument(
"--router-max-tree-size",
type=int,
default=2**20,
help="KV Router: Maximum tree size before pruning. Only used when --no-kv-events is set. When the indexer tree exceeds this size, pruning is triggered (default: 1048576, which is 2^20)",
)
parser.add_argument(
"--router-prune-target-ratio",
type=float,
default=0.8,
help="KV Router: Target size ratio after pruning (0.0-1.0). Only used when --no-kv-events is set. Determines how aggressively to prune the tree (default: 0.8)",
help="KV Router: Number of event processing threads. When > 1, uses a concurrent radix tree with a thread pool for higher throughput. Can be set via DYN_ROUTER_EVENT_THREADS env var (default: 1).",
)
returnparser.parse_args()
@dynamo_worker()
@dynamo_worker()
asyncdefworker(runtime:DistributedRuntime):
asyncdefworker(runtime:DistributedRuntime):
"""Main worker function for the standalone router service."""
"""Main worker function for the standalone router service."""
args=parse_args()
config=parse_args()
# Parse endpoint path to get namespace for service registration
help="Full endpoint path for workers in the format namespace.component.endpoint (e.g., dynamo.prefill.generate for prefill workers)",
arg_type=str,
)
add_argument(
g,
flag_name="--router-block-size",
env_var="DYN_ROUTER_BLOCK_SIZE",
default=128,
help="KV cache block size for routing decisions",
arg_type=int,
obsolete_flag="--block-size",
)
add_argument(
g,
flag_name="--router-kv-overlap-score-weight",
env_var="DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT",
default=1.0,
help="KV Router: Weight for overlap score in worker selection. Higher values prioritize KV cache reuse",
arg_type=float,
obsolete_flag="--kv-overlap-score-weight",
)
add_argument(
g,
flag_name="--router-temperature",
env_var="DYN_ROUTER_TEMPERATURE",
default=0.0,
help="KV Router: Temperature for worker sampling via softmax. Higher values promote more randomness, and 0 fallbacks to deterministic.",
arg_type=float,
)
add_negatable_bool_argument(
g,
flag_name="--router-kv-events",
env_var="DYN_ROUTER_USE_KV_EVENTS",
default=True,
help="KV Router: Enable KV events from workers. When disabled (--no-router-kv-events), the router predicts cache state based on routing decisions with TTL-based expiration and pruning, rather than receiving events from workers.",
dest="router_use_kv_events",
obsolete_flag="--kv-events",
)
add_negatable_bool_argument(
g,
flag_name="--router-replica-sync",
env_var="DYN_ROUTER_REPLICA_SYNC",
default=False,
help="KV Router: Enable replica synchronization across multiple router instances. When true, routers will publish and subscribe to events to maintain consistent state.",
)
add_argument(
g,
flag_name="--router-snapshot-threshold",
env_var="DYN_ROUTER_SNAPSHOT_THRESHOLD",
default=1000000,
help="KV Router: Number of messages in stream before triggering a snapshot",
arg_type=int,
)
add_negatable_bool_argument(
g,
flag_name="--router-reset-states",
env_var="DYN_ROUTER_RESET_STATES",
default=False,
help="KV Router: Reset router state on startup, purging stream and object store. WARNING: Can affect existing router replicas.",
)
add_negatable_bool_argument(
g,
flag_name="--router-durable-kv-events",
env_var="DYN_ROUTER_DURABLE_KV_EVENTS",
default=False,
help="KV Router: Enable durable KV events using NATS JetStream instead of NATS Core. By default, the router uses the generic event plane (NATS Core or ZMQ) with local_indexer mode. Use this flag when you need durability and multi-replica consistency. Requires NATS with JetStream enabled.",
obsolete_flag="--durable-kv-events",
)
add_negatable_bool_argument(
g,
flag_name="--router-track-active-blocks",
env_var="DYN_ROUTER_TRACK_ACTIVE_BLOCKS",
default=True,
help="KV Router: Track active blocks for load balancing. Use --no-router-track-active-blocks to disable",
obsolete_flag="--track-active-blocks",
)
add_negatable_bool_argument(
g,
flag_name="--router-assume-kv-reuse",
env_var="DYN_ROUTER_ASSUME_KV_REUSE",
default=True,
help="KV Router: When tracking active blocks, assume KV cache reuse. Use --no-router-assume-kv-reuse to use random hashes, useful when KV cache reuse is not expected.",
obsolete_flag="--assume-kv-reuse",
)
add_negatable_bool_argument(
g,
flag_name="--router-track-output-blocks",
env_var="DYN_ROUTER_TRACK_OUTPUT_BLOCKS",
default=False,
help="KV Router: Track output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward expected output sequence length (agent_hints.osl in nvext).",
obsolete_flag="--track-output-blocks",
)
add_argument(
g,
flag_name="--router-ttl-secs",
env_var="DYN_ROUTER_TTL_SECS",
default=120.0,
help="KV Router: TTL for blocks in seconds. Only used when --no-router-kv-events is set. Controls how long cached blocks are considered valid without explicit events.",
arg_type=float,
)
add_argument(
g,
flag_name="--router-max-tree-size",
env_var="DYN_ROUTER_MAX_TREE_SIZE",
default=2**20,
help="KV Router: Maximum tree size before pruning. Only used when --no-router-kv-events is set. When the indexer tree exceeds this size, pruning is triggered.",
arg_type=int,
)
add_argument(
g,
flag_name="--router-prune-target-ratio",
env_var="DYN_ROUTER_PRUNE_TARGET_RATIO",
default=0.8,
help="KV Router: Target size ratio after pruning (0.0-1.0). Only used when --no-router-kv-events is set. Determines how aggressively to prune the tree.",
arg_type=float,
)
add_argument(
g,
flag_name="--router-event-threads",
env_var="DYN_ROUTER_EVENT_THREADS",
default=1,
help="KV Router: Number of event processing threads. >1 uses concurrent radix tree and thread pool for higher throughput.",
@@ -49,11 +49,11 @@ A request with `latency_sensitivity: 5.0` arriving at time `T` is treated as if
...
@@ -49,11 +49,11 @@ A request with `latency_sensitivity: 5.0` arriving at time `T` is treated as if
Expected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways:
Expected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways:
1.**Output block tracking**: When `--track-output-blocks` is enabled, the router adds placeholder blocks during generation and applies fractional decay based on progress toward `osl`. This gives the router a more accurate picture of each worker's KV cache utilization for long-running requests.
1.**Output block tracking**: When output block tracking is enabled (frontend: `--track-output-blocks`; standalone router: `--router-track-output-blocks`), the router adds placeholder blocks during generation and applies fractional decay based on progress toward `osl`. This gives the router a more accurate picture of each worker's KV cache utilization for long-running requests.
2.**Resource estimation**: Helps the router estimate total resource requirements when making routing decisions.
2.**Resource estimation**: Helps the router estimate total resource requirements when making routing decisions.
-**Type**: `u32` (optional)
-**Type**: `u32` (optional)
-**Requires**: `--track-output-blocks` for output block tracking behavior
-**Requires**: `--track-output-blocks`(frontend) or `--router-track-output-blocks` (standalone router) for output block tracking behavior
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags (e.g., `--router-block-size`, `--router-kv-events`). See the [Standalone Router README](../../../../components/src/dynamo/router/README.md) and example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).