Unverified Commit 858f33fc authored by jh-nv's avatar jh-nv Committed by GitHub
Browse files

feat: migrate router configuration (#6346)

parent af5ace66
...@@ -116,6 +116,7 @@ def add_negatable_bool_argument( ...@@ -116,6 +116,7 @@ def add_negatable_bool_argument(
default: bool, default: bool,
help: str, help: str,
dest: Optional[str] = None, dest: Optional[str] = None,
obsolete_flag: Optional[str] = None,
) -> None: ) -> None:
""" """
Add negatable boolean flag (--foo / --no-foo). Add negatable boolean flag (--foo / --no-foo).
...@@ -126,6 +127,8 @@ def add_negatable_bool_argument( ...@@ -126,6 +127,8 @@ def add_negatable_bool_argument(
env_var: Environment variable name (e.g., "DYN_ENABLE_FEATURE") env_var: Environment variable name (e.g., "DYN_ENABLE_FEATURE")
default: Default value default: Default value
help: Help text help: Help text
dest: Optional destination name for the parsed value
obsolete_flag: Optional obsolete/legacy flag (for help msg only, must start with '--')
""" """
add_argument( add_argument(
parser, parser,
...@@ -134,6 +137,7 @@ def add_negatable_bool_argument( ...@@ -134,6 +137,7 @@ def add_negatable_bool_argument(
default=default, default=default,
help=help, help=help,
dest=dest, dest=dest,
obsolete_flag=obsolete_flag,
arg_type=None, arg_type=None,
action=argparse.BooleanOptionalAction, action=argparse.BooleanOptionalAction,
) )
......
...@@ -18,9 +18,9 @@ This component is **fully configurable** and works with any Dynamo backend (vLLM ...@@ -18,9 +18,9 @@ This component is **fully configurable** and works with any Dynamo backend (vLLM
```bash ```bash
python -m dynamo.router \ python -m dynamo.router \
--endpoint dynamo.prefill.generate \ --endpoint dynamo.prefill.generate \
--block-size 64 \ --router-block-size 64 \
--router-reset-states \ --router-reset-states \
--no-track-active-blocks --no-router-track-active-blocks
``` ```
### Arguments ### Arguments
...@@ -29,16 +29,16 @@ python -m dynamo.router \ ...@@ -29,16 +29,16 @@ python -m dynamo.router \
- `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`) - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)
**Router Configuration:** **Router Configuration:**
For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/pages/components/router/router-guide.md). All router options use the `--router-*` prefix (e.g., `--router-block-size`, `--router-kv-overlap-score-weight`, `--router-temperature`, `--router-kv-events` / `--no-router-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, `--router-track-active-blocks` / `--no-router-track-active-blocks`). Legacy names without the prefix (e.g., `--block-size`, `--kv-events`) are still accepted but deprecated. For detailed descriptions, see the [Router Guide](/docs/pages/components/router/router-guide.md).
## Architecture ## Architecture
The standalone router exposes two endpoints via the Dynamo runtime: The standalone router exposes two endpoints via the Dynamo runtime:
1. **`find_best_worker`**: Given a request with token IDs, returns the best worker to handle it 1. **`generate`**: Routes requests to the best worker and streams back generation results (KV-aware routing).
2. **`free`**: Cleans up router state when a request completes 2. **`best_worker_id`**: Given token IDs, returns the best worker ID for the request without routing; useful for debugging or custom routing logic.
Clients query the `find_best_worker` endpoint to determine which worker should process each request, then call the selected worker directly. Clients call the `generate` endpoint to stream completions, or call `best_worker_id` to decide which worker to use and then contact that worker directly.
## Example: Manual Disaggregated Serving (Alternative Setup) ## Example: Manual Disaggregated Serving (Alternative Setup)
...@@ -59,9 +59,9 @@ python -m dynamo.frontend \ ...@@ -59,9 +59,9 @@ python -m dynamo.frontend \
# Start standalone router for prefill workers # Start standalone router for prefill workers
python -m dynamo.router \ python -m dynamo.router \
--endpoint dynamo.prefill.generate \ --endpoint dynamo.prefill.generate \
--block-size 64 \ --router-block-size 64 \
--router-reset-states \ --router-reset-states \
--no-track-active-blocks --no-router-track-active-blocks
# Start decode workers # Start decode workers
python -m dynamo.vllm --model MODEL_NAME --block-size 64 & python -m dynamo.vllm --model MODEL_NAME --block-size 64 &
...@@ -71,10 +71,10 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker & ...@@ -71,10 +71,10 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker &
``` ```
>[!Note] >[!Note]
> **Why `--no-track-active-blocks` for prefill routing?** > **Why `--no-router-track-active-blocks` for prefill routing?**
> Active block tracking is used for load balancing across decode (generation) phases. For prefill-only routing, decode load is not relevant, so disabling this reduces overhead and simplifies the router state. > Active block tracking is used for load balancing across decode (generation) phases. For prefill-only routing, decode load is not relevant, so disabling this reduces overhead and simplifies the router state.
> >
> **Why `--block-size` is required for standalone routers:** > **Why `--router-block-size` is required for standalone routers:**
> Unlike the frontend router which can infer block size from the ModelDeploymentCard (MDC) during worker registration, standalone routers cannot access the MDC and must have the block size explicitly specified. This is a work in progress to enable automatic inference. > Unlike the frontend router which can infer block size from the ModelDeploymentCard (MDC) during worker registration, standalone routers cannot access the MDC and must have the block size explicitly specified. This is a work in progress to enable automatic inference.
## Configuration Best Practices ## Configuration Best Practices
...@@ -82,8 +82,8 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker & ...@@ -82,8 +82,8 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker &
>[!Note] >[!Note]
> **Block Size Matching:** > **Block Size Matching:**
> The block size must match across: > The block size must match across:
> - Standalone router (`--block-size`) > - Standalone router (`--router-block-size`)
> - All worker instances (`--block-size`) > - All worker instances (backend-specific, e.g. `--block-size` for vLLM)
> >
> **Endpoint Matching:** > **Endpoint Matching:**
> The `--endpoint` argument must match where your target workers register. For example: > The `--endpoint` argument must match where your target workers register. For example:
...@@ -95,9 +95,9 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker & ...@@ -95,9 +95,9 @@ python -m dynamo.vllm --model MODEL_NAME --block-size 64 --is-prefill-worker &
To integrate the standalone router with a backend: To integrate the standalone router with a backend:
1. Clients should query the `router.find_best_worker` endpoint before sending requests 1. Workers should register at the endpoint specified by the `--endpoint` argument
2. Workers should register at the endpoint specified by the `--endpoint` argument 2. Clients call the `router.generate` endpoint to stream completions (router selects the best worker), or call `router.best_worker_id` to get the best worker ID and then send requests to that worker
3. Clients should call the `router.free` endpoint when requests complete 3. Router state is updated automatically as requests are routed; no separate "free" call is required
See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a reference implementation (search for `prefill_router_client`). See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a reference implementation (search for `prefill_router_client`).
......
...@@ -12,15 +12,16 @@ to prefill workers) or any other scenario requiring intelligent KV cache-aware ...@@ -12,15 +12,16 @@ to prefill workers) or any other scenario requiring intelligent KV cache-aware
routing decisions. routing decisions.
""" """
import argparse
import asyncio import asyncio
import logging import logging
import os
from typing import Optional from typing import Optional
import uvloop import uvloop
from dynamo.llm import KvRouter, KvRouterConfig from dynamo.llm import KvRouter, KvRouterConfig
from dynamo.router.args import build_kv_router_config
from dynamo.router.args import parse_args as parse_router_args
from dynamo.router.backend_args import DynamoRouterConfig
from dynamo.runtime import Client, DistributedRuntime, dynamo_worker from dynamo.runtime import Client, DistributedRuntime, dynamo_worker
from dynamo.runtime.logging import configure_dynamo_logging from dynamo.runtime.logging import configure_dynamo_logging
...@@ -151,192 +152,42 @@ class StandaloneRouterHandler: ...@@ -151,192 +152,42 @@ class StandaloneRouterHandler:
yield worker_id yield worker_id
def parse_args(): def parse_args(argv=None) -> DynamoRouterConfig:
parser = argparse.ArgumentParser( """Parse router CLI arguments (compatibility shim delegating to args.parse_args)."""
description="Dynamo Standalone Router Service: Configurable KV-aware routing for any worker endpoint", return parse_router_args(argv)
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument(
"--endpoint",
type=str,
required=True,
help=(
"Full endpoint path for workers in the format namespace.component.endpoint\n"
"(e.g., dynamo.prefill.generate for prefill workers)"
),
)
parser.add_argument(
"--block-size",
type=int,
default=128,
help="KV cache block size for routing decisions (default: 128)",
)
parser.add_argument(
"--kv-overlap-score-weight",
type=float,
default=1.0,
help="KV Router: Weight for overlap score in worker selection. Higher values prioritize KV cache reuse (default: 1.0)",
)
parser.add_argument(
"--router-temperature",
type=float,
default=0.0,
help="KV Router: Temperature for worker sampling via softmax. Higher values promote more randomness, and 0 fallbacks to deterministic (default: 0.0)",
)
parser.add_argument(
"--no-kv-events",
action="store_false",
dest="use_kv_events",
default=True,
help="KV Router: Disable KV events. When set, the router predicts cache state based on routing decisions with TTL-based expiration and pruning, rather than receiving events from workers. By default, KV events are enabled.",
)
parser.add_argument(
"--router-replica-sync",
action="store_true",
default=False,
help="KV Router: Enable replica synchronization across multiple router instances. When true, routers will publish and subscribe to events to maintain consistent state (default: False)",
)
parser.add_argument(
"--router-snapshot-threshold",
type=int,
default=1000000,
help="KV Router: Number of messages in stream before triggering a snapshot (default: 1000000)",
)
parser.add_argument(
"--router-reset-states",
action="store_true",
dest="router_reset_states",
default=False,
help="KV Router: Reset router state on startup, purging stream and object store. By default, states are persisted. WARNING: This can affect existing router replicas (default: False)",
)
parser.add_argument(
"--durable-kv-events",
action="store_true",
dest="durable_kv_events",
default=False,
help="KV Router: Enable durable KV events using NATS JetStream instead of NATS Core. By default, the router uses the generic event plane (NATS Core or ZMQ) with local_indexer mode. Use this flag when you need durability and multi-replica consistency. Requires NATS with JetStream enabled.",
)
parser.add_argument(
"--no-track-active-blocks",
action="store_false",
dest="router_track_active_blocks",
default=True,
help="KV Router: Disable tracking of active blocks (blocks being used for ongoing generation). By default, active blocks are tracked for load balancing (default: True)",
)
parser.add_argument(
"--no-assume-kv-reuse",
action="store_false",
dest="router_assume_kv_reuse",
default=True,
help="KV Router: When tracking active blocks, do not assume KV cache reuse (generate random hashes instead of computing actual block hashes). Useful when KV cache reuse is not expected. By default, KV cache reuse is assumed.",
)
parser.add_argument(
"--track-output-blocks",
action="store_true",
dest="router_track_output_blocks",
default=False,
help="KV Router: Track output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward expected output sequence length (agent_hints.osl in nvext). Default: False.",
)
parser.add_argument(
"--router-ttl-secs",
type=float,
default=120.0,
help="KV Router: TTL for blocks in seconds. Only used when --no-kv-events is set. Controls how long cached blocks are considered valid without explicit events (default: 120.0)",
)
parser.add_argument(
"--router-max-tree-size",
type=int,
default=2**20,
help="KV Router: Maximum tree size before pruning. Only used when --no-kv-events is set. When the indexer tree exceeds this size, pruning is triggered (default: 1048576, which is 2^20)",
)
parser.add_argument(
"--router-prune-target-ratio",
type=float,
default=0.8,
help="KV Router: Target size ratio after pruning (0.0-1.0). Only used when --no-kv-events is set. Determines how aggressively to prune the tree (default: 0.8)",
)
parser.add_argument(
"--router-event-threads",
type=int,
default=int(os.environ.get("DYN_ROUTER_EVENT_THREADS", "1")),
help="KV Router: Number of event processing threads. When > 1, uses a concurrent radix tree with a thread pool for higher throughput. Can be set via DYN_ROUTER_EVENT_THREADS env var (default: 1).",
)
return parser.parse_args()
@dynamo_worker() @dynamo_worker()
async def worker(runtime: DistributedRuntime): async def worker(runtime: DistributedRuntime):
"""Main worker function for the standalone router service.""" """Main worker function for the standalone router service."""
args = parse_args() config = parse_args()
# Parse endpoint path to get namespace for service registration
endpoint_parts = args.endpoint.split(".")
if len(endpoint_parts) != 3:
raise ValueError(
f"Invalid endpoint path format: {args.endpoint}. "
"Expected format: namespace.component.endpoint"
)
namespace = endpoint_parts[0]
logger.info("Starting Standalone Router Service") logger.info("Starting Standalone Router Service")
logger.debug( logger.debug(
f"Configuration: endpoint={args.endpoint}, block_size={args.block_size}, " f"Configuration: endpoint={config.endpoint}, router_block_size={config.router_block_size}, "
f"overlap_score_weight={args.kv_overlap_score_weight}, " f"overlap_score_weight={config.router_kv_overlap_score_weight}, "
f"router_temperature={args.router_temperature}, " f"router_temperature={config.router_temperature}, "
f"use_kv_events={args.use_kv_events}, " f"router_use_kv_events={config.router_use_kv_events}, "
f"durable_kv_events={args.durable_kv_events}, " f"router_durable_kv_events={config.router_durable_kv_events}, "
f"router_replica_sync={args.router_replica_sync}, " f"router_replica_sync={config.router_replica_sync}, "
f"router_reset_states={args.router_reset_states}, " f"router_reset_states={config.router_reset_states}, "
f"router_track_active_blocks={args.router_track_active_blocks}, " f"router_track_active_blocks={config.router_track_active_blocks}, "
f"router_track_output_blocks={args.router_track_output_blocks}, " f"router_track_output_blocks={config.router_track_output_blocks}, "
f"router_assume_kv_reuse={args.router_assume_kv_reuse}, " f"router_assume_kv_reuse={config.router_assume_kv_reuse}, "
f"router_ttl_secs={args.router_ttl_secs}, " f"router_ttl_secs={config.router_ttl_secs}, "
f"router_max_tree_size={args.router_max_tree_size}, " f"router_max_tree_size={config.router_max_tree_size}, "
f"router_prune_target_ratio={args.router_prune_target_ratio}" f"router_prune_target_ratio={config.router_prune_target_ratio}"
) )
# Create KvRouter configuration kv_router_config = build_kv_router_config(config)
kv_router_config = KvRouterConfig(
overlap_score_weight=args.kv_overlap_score_weight,
router_temperature=args.router_temperature,
use_kv_events=args.use_kv_events,
durable_kv_events=args.durable_kv_events,
router_replica_sync=args.router_replica_sync,
router_track_active_blocks=args.router_track_active_blocks,
router_track_output_blocks=args.router_track_output_blocks,
router_assume_kv_reuse=args.router_assume_kv_reuse,
router_snapshot_threshold=args.router_snapshot_threshold,
router_reset_states=args.router_reset_states,
router_ttl_secs=args.router_ttl_secs,
router_max_tree_size=args.router_max_tree_size,
router_prune_target_ratio=args.router_prune_target_ratio,
router_event_threads=args.router_event_threads,
)
# Create service component - use "router" as component name # Create service component - use "router" as component name
component = runtime.namespace(namespace).component("router") component = runtime.namespace(config.namespace).component("router")
# Create handler # Create handler
handler = StandaloneRouterHandler( handler = StandaloneRouterHandler(
runtime, args.endpoint, args.block_size, kv_router_config runtime, config.endpoint, config.router_block_size, kv_router_config
) )
await handler.initialize() await handler.initialize()
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Router CLI parsing and config assembly."""
import argparse
from dynamo.llm import KvRouterConfig
from .backend_args import DynamoRouterArgGroup, DynamoRouterConfig
def build_kv_router_config(router_config: DynamoRouterConfig) -> KvRouterConfig:
"""Build KvRouterConfig from DynamoRouterConfig.
Maps CLI/config attribute names to KvRouterConfig constructor kwargs.
The only name difference is router_kv_overlap_score_weight -> overlap_score_weight.
"""
return KvRouterConfig(
overlap_score_weight=router_config.router_kv_overlap_score_weight,
router_temperature=router_config.router_temperature,
use_kv_events=router_config.router_use_kv_events,
durable_kv_events=router_config.router_durable_kv_events,
router_replica_sync=router_config.router_replica_sync,
router_track_active_blocks=router_config.router_track_active_blocks,
router_track_output_blocks=router_config.router_track_output_blocks,
router_assume_kv_reuse=router_config.router_assume_kv_reuse,
router_snapshot_threshold=router_config.router_snapshot_threshold,
router_reset_states=router_config.router_reset_states,
router_ttl_secs=router_config.router_ttl_secs,
router_max_tree_size=router_config.router_max_tree_size,
router_prune_target_ratio=router_config.router_prune_target_ratio,
router_event_threads=router_config.router_event_threads,
)
def parse_args(argv=None) -> DynamoRouterConfig:
"""Parse command-line arguments for the standalone router.
Returns:
DynamoRouterConfig: Parsed and validated configuration.
"""
parser = argparse.ArgumentParser(
description="Dynamo Standalone Router Service: Configurable KV-aware routing for any worker endpoint",
formatter_class=argparse.RawTextHelpFormatter,
)
group = DynamoRouterArgGroup()
group.add_arguments(parser)
args = parser.parse_args(argv)
config = DynamoRouterConfig.from_cli_args(args)
config.validate()
return config
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Dynamo standalone router configuration ArgGroup."""
from dynamo.common.configuration.arg_group import ArgGroup
from dynamo.common.configuration.config_base import ConfigBase
from dynamo.common.configuration.utils import add_argument, add_negatable_bool_argument
class DynamoRouterConfig(ConfigBase):
"""Typed configuration for the standalone KV router (router-owned options only)."""
namespace: str
endpoint: str
router_block_size: int
router_kv_overlap_score_weight: float
router_temperature: float
router_use_kv_events: bool
router_replica_sync: bool
router_snapshot_threshold: int
router_reset_states: bool
router_durable_kv_events: bool
router_track_active_blocks: bool
router_assume_kv_reuse: bool
router_track_output_blocks: bool
router_ttl_secs: float
router_max_tree_size: int
router_prune_target_ratio: float
router_event_threads: int
def validate(self) -> None:
"""Validate config invariants (aligned with Rust KvRouterConfig where applicable)."""
if not self.endpoint:
raise ValueError(
"endpoint is required (set --endpoint or DYN_ROUTER_ENDPOINT)"
)
parts = self.endpoint.split(".")
if len(parts) != 3:
raise ValueError(
f"Invalid endpoint format: {self.endpoint!r}. "
"Expected format: namespace.component.endpoint"
)
self.namespace = parts[0]
class DynamoRouterArgGroup(ArgGroup):
"""CLI argument group for standalone router options."""
name = "dynamo-router"
def add_arguments(self, parser) -> None:
"""Add router-owned arguments to parser."""
g = parser.add_argument_group("Dynamo Router Options")
add_argument(
g,
flag_name="--endpoint",
env_var="DYN_ROUTER_ENDPOINT",
default=None,
help="Full endpoint path for workers in the format namespace.component.endpoint (e.g., dynamo.prefill.generate for prefill workers)",
arg_type=str,
)
add_argument(
g,
flag_name="--router-block-size",
env_var="DYN_ROUTER_BLOCK_SIZE",
default=128,
help="KV cache block size for routing decisions",
arg_type=int,
obsolete_flag="--block-size",
)
add_argument(
g,
flag_name="--router-kv-overlap-score-weight",
env_var="DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT",
default=1.0,
help="KV Router: Weight for overlap score in worker selection. Higher values prioritize KV cache reuse",
arg_type=float,
obsolete_flag="--kv-overlap-score-weight",
)
add_argument(
g,
flag_name="--router-temperature",
env_var="DYN_ROUTER_TEMPERATURE",
default=0.0,
help="KV Router: Temperature for worker sampling via softmax. Higher values promote more randomness, and 0 fallbacks to deterministic.",
arg_type=float,
)
add_negatable_bool_argument(
g,
flag_name="--router-kv-events",
env_var="DYN_ROUTER_USE_KV_EVENTS",
default=True,
help="KV Router: Enable KV events from workers. When disabled (--no-router-kv-events), the router predicts cache state based on routing decisions with TTL-based expiration and pruning, rather than receiving events from workers.",
dest="router_use_kv_events",
obsolete_flag="--kv-events",
)
add_negatable_bool_argument(
g,
flag_name="--router-replica-sync",
env_var="DYN_ROUTER_REPLICA_SYNC",
default=False,
help="KV Router: Enable replica synchronization across multiple router instances. When true, routers will publish and subscribe to events to maintain consistent state.",
)
add_argument(
g,
flag_name="--router-snapshot-threshold",
env_var="DYN_ROUTER_SNAPSHOT_THRESHOLD",
default=1000000,
help="KV Router: Number of messages in stream before triggering a snapshot",
arg_type=int,
)
add_negatable_bool_argument(
g,
flag_name="--router-reset-states",
env_var="DYN_ROUTER_RESET_STATES",
default=False,
help="KV Router: Reset router state on startup, purging stream and object store. WARNING: Can affect existing router replicas.",
)
add_negatable_bool_argument(
g,
flag_name="--router-durable-kv-events",
env_var="DYN_ROUTER_DURABLE_KV_EVENTS",
default=False,
help="KV Router: Enable durable KV events using NATS JetStream instead of NATS Core. By default, the router uses the generic event plane (NATS Core or ZMQ) with local_indexer mode. Use this flag when you need durability and multi-replica consistency. Requires NATS with JetStream enabled.",
obsolete_flag="--durable-kv-events",
)
add_negatable_bool_argument(
g,
flag_name="--router-track-active-blocks",
env_var="DYN_ROUTER_TRACK_ACTIVE_BLOCKS",
default=True,
help="KV Router: Track active blocks for load balancing. Use --no-router-track-active-blocks to disable",
obsolete_flag="--track-active-blocks",
)
add_negatable_bool_argument(
g,
flag_name="--router-assume-kv-reuse",
env_var="DYN_ROUTER_ASSUME_KV_REUSE",
default=True,
help="KV Router: When tracking active blocks, assume KV cache reuse. Use --no-router-assume-kv-reuse to use random hashes, useful when KV cache reuse is not expected.",
obsolete_flag="--assume-kv-reuse",
)
add_negatable_bool_argument(
g,
flag_name="--router-track-output-blocks",
env_var="DYN_ROUTER_TRACK_OUTPUT_BLOCKS",
default=False,
help="KV Router: Track output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward expected output sequence length (agent_hints.osl in nvext).",
obsolete_flag="--track-output-blocks",
)
add_argument(
g,
flag_name="--router-ttl-secs",
env_var="DYN_ROUTER_TTL_SECS",
default=120.0,
help="KV Router: TTL for blocks in seconds. Only used when --no-router-kv-events is set. Controls how long cached blocks are considered valid without explicit events.",
arg_type=float,
)
add_argument(
g,
flag_name="--router-max-tree-size",
env_var="DYN_ROUTER_MAX_TREE_SIZE",
default=2**20,
help="KV Router: Maximum tree size before pruning. Only used when --no-router-kv-events is set. When the indexer tree exceeds this size, pruning is triggered.",
arg_type=int,
)
add_argument(
g,
flag_name="--router-prune-target-ratio",
env_var="DYN_ROUTER_PRUNE_TARGET_RATIO",
default=0.8,
help="KV Router: Target size ratio after pruning (0.0-1.0). Only used when --no-router-kv-events is set. Determines how aggressively to prune the tree.",
arg_type=float,
)
add_argument(
g,
flag_name="--router-event-threads",
env_var="DYN_ROUTER_EVENT_THREADS",
default=1,
help="KV Router: Number of event processing threads. >1 uses concurrent radix tree and thread pool for higher throughput.",
arg_type=int,
)
...@@ -49,11 +49,11 @@ A request with `latency_sensitivity: 5.0` arriving at time `T` is treated as if ...@@ -49,11 +49,11 @@ A request with `latency_sensitivity: 5.0` arriving at time `T` is treated as if
Expected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways: Expected output sequence length — the estimated number of output tokens the request will generate. The router uses this hint in two ways:
1. **Output block tracking**: When `--track-output-blocks` is enabled, the router adds placeholder blocks during generation and applies fractional decay based on progress toward `osl`. This gives the router a more accurate picture of each worker's KV cache utilization for long-running requests. 1. **Output block tracking**: When output block tracking is enabled (frontend: `--track-output-blocks`; standalone router: `--router-track-output-blocks`), the router adds placeholder blocks during generation and applies fractional decay based on progress toward `osl`. This gives the router a more accurate picture of each worker's KV cache utilization for long-running requests.
2. **Resource estimation**: Helps the router estimate total resource requirements when making routing decisions. 2. **Resource estimation**: Helps the router estimate total resource requirements when making routing decisions.
- **Type**: `u32` (optional) - **Type**: `u32` (optional)
- **Requires**: `--track-output-blocks` for output block tracking behavior - **Requires**: `--track-output-blocks` (frontend) or `--router-track-output-blocks` (standalone router) for output block tracking behavior
### Example ### Example
......
...@@ -310,7 +310,7 @@ await prefill_endpoint.serve_endpoint(prefill_handler.generate) ...@@ -310,7 +310,7 @@ await prefill_endpoint.serve_endpoint(prefill_handler.generate)
``` ```
> [!Note] > [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh). > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags (e.g., `--router-block-size`, `--router-kv-events`). See the [Standalone Router README](../../../../components/src/dynamo/router/README.md) and example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
### Request Flow ### Request Flow
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment