test: generalize router test infrastructure and expand documentations (#7327)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

test: generalize router test infrastructure and expand documentations (#7327)
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
33604afa · Keiven C · GitHub · cb9b1cd5 · 33604afa · 33604afa
Unverified Commit 33604afa authored Mar 13, 2026 by Keiven C Committed by GitHub Mar 13, 2026
5 changed files
--- a/docs/components/router/router-guide.md
+++ b/docs/components/router/router-guide.md
@@ -10,6 +10,57 @@ subtitle: Enable KV-aware routing using Router for Dynamo deployments
 The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
 This guide helps you get started with using the Dynamo router, with further details on configuration, disaggregated serving setup, and parameter tuning.
+## Deployment Modes
+The Dynamo router can be deployed in several configurations. The table below shows every combination and when to use it:
+| Mode | Command | Routing Logic | KV Events | Topology | Use Case |
+|------|---------|---------------|-----------|----------|----------|
+| **Frontend + Round-Robin** | `python -m dynamo.frontend --router-mode round-robin` | Cycles through workers | None | Aggregated | Simplest baseline; no KV awareness |
+| **Frontend + Random** | `python -m dynamo.frontend --router-mode random` | Random worker selection | None | Aggregated | Stateless load balancing |
+| **Frontend + KV (Aggregated)** | `python -m dynamo.frontend --router-mode kv` | KV cache overlap + load | NATS Core / JetStream / ZMQ / Approx | Aggregated | Production single-pool serving with cache reuse |
+| **Frontend + KV (Disaggregated)** | `python -m dynamo.frontend --router-mode kv` with prefill + decode workers | KV cache overlap + load | NATS Core / JetStream / ZMQ / Approx | Disaggregated (prefill + decode pools) | Separate prefill/decode for large-scale serving |
+| **Frontend + Direct** | `python -m dynamo.frontend --router-mode direct` | Worker ID from request hints | None | Aggregated | External orchestrator (e.g., EPP/GAIE) selects workers |
+| **Standalone Router** | `python -m dynamo.router` | KV cache overlap + load | NATS Core / JetStream / ZMQ | Any | Routing without the HTTP frontend (multi-tier, custom pipelines) |
+### Routing Modes (`--router-mode`)
+| Mode | Value | How Workers Are Selected |
+|------|-------|-------------------------|
+| **Round-Robin** | `round-robin` (default) | Cycles through available workers in order |
+| **Random** | `random` | Selects a random worker for each request |
+| **KV** | `kv` | Evaluates KV cache overlap and decode load per worker; picks lowest cost |
+| **Direct** | `direct` | Reads the target `worker_id` from the request's routing hints; no selection logic |
+### KV Event Transport Modes (within `--router-mode kv`)
+When using KV routing, the router needs to know what each worker has cached. There are four ways to get this information:
+| Event Mode | How to Enable | Description |
+|------------|---------------|-------------|
+| **NATS Core (local indexer)** | Default (no extra flags) | Workers maintain a local indexer; router queries workers on startup and receives events via NATS Core |
+| **JetStream (durable)** | `--router-durable-kv-events` | Events persisted in NATS JetStream; supports snapshots and durable consumers. *Deprecated.* |
+| **ZMQ** | `--event-plane zmq` | Workers publish via ZMQ PUB sockets; standalone indexer aggregates events |
+| **Approximate (no events)** | `--no-router-kv-events` | No events consumed; router predicts cache state from its own routing decisions with TTL-based expiration |
+### Aggregated vs. Disaggregated Topology
+| Topology | Workers | How It Works |
+|----------|---------|--------------|
+| **Aggregated** | Single pool (prefill + decode in one process) | All workers handle the full request lifecycle |
+| **Disaggregated** | Separate prefill and decode pools | Frontend routes to a prefill worker first, then to a decode worker; requires workers registered with `ModelType.Prefill` |
+Disaggregated mode is activated automatically when prefill workers register alongside decode workers. See [Disaggregated Serving](#disaggregated-serving) for details.
+### Frontend-Embedded vs. Standalone Router
+| Deployment | Process | Metrics Port | Use Case |
+|------------|---------|--------------|----------|
+| **Frontend-embedded** | `python -m dynamo.frontend --router-mode kv` | Frontend HTTP port (default 8000) | Standard deployment; router runs inside the frontend process |
+| **Standalone** | `python -m dynamo.router` | `DYN_SYSTEM_PORT` (if set) | Multi-tier architectures, SGLang disagg prefill routing, custom pipelines |
+The standalone router does not include the HTTP frontend (no `/v1/chat/completions` endpoint). It exposes only the `RouterRequestMetrics` via the system status server. See the [Standalone Router README](../../../components/src/dynamo/router/README.md).
 ## Quick Start
 ### Python / CLI Deployment

--- a/docs/observability/metrics.md
+++ b/docs/observability/metrics.md
@@ -223,7 +223,29 @@ Suppose the backend allows 3 concurrent requests and there are 10 clients contin
 The router exposes metrics for monitoring routing decisions and overhead. Defined in `lib/llm/src/kv_router/metrics.rs`.
-For router configuration and tuning, see the [Router Guide](../components/router/router-guide.md).
+For router configuration, deployment modes, and tuning, see the [Router Guide](../components/router/router-guide.md).
+#### Metrics Availability by Configuration
+Not all metrics appear in every deployment. The chart below shows which metric groups are **registered** and **populated** in each configuration:
+| Metric Group | Frontend + KV (agg) | Frontend + KV (disagg) | Frontend + non-KV (round-robin/random/direct) | Standalone Router |
+|---|---|---|---|---|
+| `dynamo_component_router_*` (request metrics) | Registered and populated | Registered and populated | Registered, **always zero** | Populated (on `DYN_SYSTEM_PORT`) |
+| `dynamo_router_overhead_*` (routing overhead) | Registered and populated | Registered and populated | **Not registered** | **Not created** |
+| `dynamo_frontend_router_queue_*` (queue depth) | Registered; populated when `--router-queue-threshold` set | Registered; populated when `--router-queue-threshold` set | **Not registered** | **Not created** |
+| `dynamo_component_kv_cache_events_applied` (indexer) | Populated when KV events are received | Populated when KV events are received | **Not registered** | Populated when KV events are received |
+| `dynamo_frontend_worker_*` (per-worker load/timing) | Registered and populated | Registered and populated (`worker_type`=`prefill`/`decode`) | Registered and populated (`worker_type`=`decode`) | **Not created** |
+**Key:**
+- **Registered and populated**: Metric appears at `/metrics` with real values
+- **Registered, always zero**: Metric appears at `/metrics` but the counter/histogram is never incremented (useful for dashboards that expect the metric to exist)
+- **Not registered / Not created**: Metric does not appear at `/metrics` at all
+**Scrape endpoints:**
+- Frontend: `/metrics` on HTTP port (default 8000, configurable via `--http-port` or `DYN_HTTP_PORT`)
+- Standalone router: `/metrics` on `DYN_SYSTEM_PORT` (must be set explicitly; default is `-1` / disabled)
+- Backend workers: `/metrics` on `DYN_SYSTEM_PORT` (separate from frontend metrics)
 #### Router Request Metrics (`dynamo_component_router_*`)
@@ -242,7 +264,7 @@ All metrics carry the standard hierarchy labels (`dynamo_namespace`, `dynamo_com
 #### Per-Request Routing Overhead (`dynamo_router_overhead_*`)
-Histograms (in milliseconds) tracking the time spent in each phase of the routing decision for every request. Registered on the frontend port (default 8000) at `/metrics` with a `router_id` label (the frontend's discovery instance ID).
+Histograms (in milliseconds) tracking the time spent in each phase of the routing decision for every request. Registered on the frontend port (default 8000) at `/metrics` with a `router_id` label (the frontend's discovery instance ID). These metrics are only created when the frontend has DRT discovery enabled (i.e., `--router-mode kv`); they do not appear in non-KV modes or on the standalone router.
 | Metric | Type | Description |
 |--------|------|-------------|
@@ -252,6 +274,16 @@ Histograms (in milliseconds) tracking the time spent in each phase of the routin
 | `dynamo_router_overhead_scheduling_ms` | Histogram | Time in scheduler worker selection |
 | `dynamo_router_overhead_total_ms` | Histogram | Total routing overhead per request |
+#### Router Queue Metrics (`dynamo_frontend_router_queue_*`)
+Gauge tracking the number of requests pending in the router's scheduler queue. Only registered when `--router-queue-threshold` is set. Labeled by `worker_type` to distinguish prefill vs. decode queues in disaggregated mode.
+| Metric | Type | Description |
+|--------|------|-------------|
+| `dynamo_frontend_router_queue_pending_requests` | Gauge | Requests pending in the router scheduler queue |
+**Labels:** `worker_type` (`prefill` or `decode`)
 #### KV Indexer Metrics
 Tracks KV cache events applied to the router's radix tree index. Only appears when `--router-kv-overlap-score-weight` is greater than 0 (default) and workers are publishing KV events. Will not appear if `--router-kv-overlap-score-weight 0` is set or no KV events have been received.
@@ -260,11 +292,11 @@ Tracks KV cache events applied to the router's radix tree index. Only appears wh
 |--------|------|-------------|
 | `dynamo_component_kv_cache_events_applied` | Counter | KV cache events applied to the index |
-**Additional labels:** `status` (`ok` / `error`), `event_type` (`stored` / `removed` / `cleared`)
+**Additional labels:** `status` (`ok` / `parent_block_not_found` / `block_not_found` / `invalid_block`), `event_type` (`stored` / `removed` / `cleared`)
 #### Per-Worker Load and Timing Gauges (`dynamo_frontend_worker_*`)
-These appear once workers register and begin serving requests. They are registered on the frontend's local Prometheus registry (not component-scoped) and do not carry `dynamo_namespace` or `dynamo_component` labels.
+These appear once workers register and begin serving requests. They are registered on the frontend's local Prometheus registry (not component-scoped) and do not carry `dynamo_namespace` or `dynamo_component` labels. These metrics are frontend-only and are not available on the standalone router.
 | Metric | Type | Description |
 |--------|------|-------------|

--- a/tests/router/common.py
+++ b/tests/router/common.py
@@ -23,7 +23,7 @@ from tests.router.helper import (
    wait_for_frontend_ready,
    wait_for_workers_ready,
 )
-from tests.router.router_process import KVRouterProcess
+from tests.router.router_process import FrontendRouterProcess, KVRouterProcess
 if TYPE_CHECKING:
    from tests.conftest import NatsServer
@@ -46,6 +46,8 @@ def _test_router_basic(
    frontend_timeout: int = 120,
    store_backend: str = "etcd",
    request_plane: str = "nats",
+    router_mode: str = "kv",
+    enforce_disagg: bool = False,
 ):
    """Basic router test: start router, wait for workers and send concurrent requests via HTTP frontend.
@@ -54,6 +56,9 @@ def _test_router_basic(
    This is a shared test implementation for both mocker and vLLM workers.
    Always waits for workers to be properly registered before sending requests to avoid flakiness.
+    Supports any router_mode (defaults to "kv" for existing callers).
+    block_size is only sent to the frontend CLI when router_mode is "kv".
    Args:
        engine_workers: Backend worker instance ({MockerProcess, VLLMProcess, TRTLLMProcess}) (already initialized with __enter__())
        block_size: Block size for KV cache
@@ -64,21 +69,27 @@ def _test_router_basic(
        frontend_timeout: Timeout for frontend readiness check (default: 120s)
        store_backend: Storage backend to use ("etcd" or "file"). Defaults to "etcd".
        request_plane: Request plane to use ("nats", "tcp", or "http"). Defaults to "nats".
+        router_mode: Router mode ("kv", "round-robin", "random", "direct"). Defaults to "kv".
+        enforce_disagg: Whether to pass --enforce-disagg to the frontend. Defaults to False.
    Raises:
        AssertionError: If requests fail or frontend doesn't become ready
        TimeoutError: If frontend doesn't become ready within timeout
    """
-    with KVRouterProcess(
+    with FrontendRouterProcess(
        request,
        block_size,
        frontend_port,
        engine_workers.namespace,
        store_backend,
+        enforce_disagg=enforce_disagg,
        request_plane=request_plane,
+        router_mode=router_mode,
    ):
-        # Start KV router frontend
+        # Start router frontend
-        logger.info(f"Starting KV router frontend on port {frontend_port}")
+        logger.info(
+            f"Starting frontend --router-mode {router_mode} on port {frontend_port}"
+        )
        frontend_url = f"http://localhost:{frontend_port}"

--- a/tests/router/router_process.py
+++ b/tests/router/router_process.py
@@ -6,8 +6,13 @@ import os
 from tests.utils.managed_process import ManagedProcess
-class KVRouterProcess(ManagedProcess):
+class FrontendRouterProcess(ManagedProcess):
-    """Manages the KV router process using dynamo.frontend"""
+    """Manages a dynamo.frontend process with configurable --router-mode.
+    Supports all router modes (round-robin, random, kv, direct) and all
+    KV-specific options (block size, thresholds, durable events, disagg).
+    block_size is only sent to the CLI when router_mode is "kv".
+    """
    def __init__(
        self,
@@ -22,15 +27,14 @@ class KVRouterProcess(ManagedProcess):
        tokens_threshold_frac: float | None = None,
        request_plane: str = "nats",
        durable_kv_events: bool = False,
+        router_mode: str = "kv",
    ):
        command = [
            "python3",
            "-m",
            "dynamo.frontend",
-            "--kv-cache-block-size",
-            str(block_size),
            "--router-mode",
-            "kv",
+            router_mode,
            "--http-port",
            str(frontend_port),
            "--discovery-backend",
@@ -39,6 +43,9 @@ class KVRouterProcess(ManagedProcess):
            namespace,
        ]
+        if router_mode == "kv":
+            command.extend(["--kv-cache-block-size", str(block_size)])
        if enforce_disagg:
            command.append("--enforce-disagg")
@@ -72,10 +79,16 @@ class KVRouterProcess(ManagedProcess):
            terminate_all_matching_process_names=False,
        )
        self.port = frontend_port
+        self.router_mode = router_mode
    def _check_ready(self, response):
-        """Check if KV router is ready"""
+        """Check if KV, random, round-robin, or direct router is ready"""
        return response.status_code == 200
    def __exit__(self, exc_type, exc_val, exc_tb):
        super().__exit__(exc_type, exc_val, exc_tb)
+# Backward-compatible alias so existing callers that import KVRouterProcess
+# continue to work without changes.
+KVRouterProcess = FrontendRouterProcess
--- a/tests/router/test_router_e2e_with_mockers.py
+++ b/tests/router/test_router_e2e_with_mockers.py
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-# Parallelization: Hermetic tests (xdist-safe via dynamic ports + per-test namespaces).
+# NOTE: These tests run reliably in serial but have encountered intermittent failures
-# Tested on: Linux container.
+# under pytest-xdist parallel execution (-n auto). Each test spawns its own
-# Combined pre_merge wall time (this file):
+# DistributedRuntime with isolated etcd/NATS and unique namespaces, but the Rust
-# - Serialized: 304.01s.
+# runtime may use process-global state (e.g. lazy_static / OnceLock singletons for
-# - Parallel (-n auto): 34.55s (269.46s saved, 8.80x).
+# endpoint tables) that races under concurrent xdist workers. Do not add
+# @pytest.mark.parallel until DRT endpoint registration is confirmed thread-safe.
 #
 # NOTE: TCP request plane is NOT tested here. These tests use --num-workers > 1 which spawns
 # multiple workers in a single process sharing one TCP server. The shared TCP server uses
@@ -637,25 +638,33 @@ class DisaggMockerProcess:
 @pytest.mark.timeout(120)  # bumped for xdist contention (was 42s; ~13.80s serial avg)
-@pytest.mark.parametrize("request_plane", ["nats", "tcp"], indirect=True)
 @pytest.mark.parametrize(
-    "durable_kv_events", [False], indirect=True
+    "router_mode,durable_kv_events",
-)  # Use NATS Core (local indexer)
+    [
-def test_mocker_kv_router(
+        pytest.param("kv", False, id="kv-nondurable"),
+        pytest.param("kv", True, id="kv-durable"),
+        pytest.param("round-robin", False, id="roundrobin"),
+        pytest.param("random", False, id="random"),
+    ],
+    indirect=["durable_kv_events"],
+)
+@pytest.mark.parametrize("request_plane", ["nats", "tcp"], indirect=True)
+def test_mocker_router(
    request,
    runtime_services_dynamic_ports,
    predownload_tokenizers,
+    router_mode,
    request_plane,
    durable_kv_events,
 ):
-    """
+    """Test router with multiple mocker engine instances across all router modes.
-    Test KV router with multiple mocker engine instances.
-    This test doesn't require GPUs and runs quickly for pre-merge validation.
-    Tests both NATS and TCP request planes.
-    """
+    Covers kv, round-robin, and random routing. Tests both NATS and TCP request planes.
+    """
    # runtime_services starts etcd and optionally nats based on request_plane
-    logger.info(f"Starting mocker KV router test with request_plane={request_plane}")
+    logger.info(
+        f"Starting mocker router test: router_mode={router_mode}, request_plane={request_plane}"
+    )
    # Create mocker args dictionary - use local indexer (NATS Core mode)
    mocker_args = {
@@ -688,12 +697,13 @@ def test_mocker_kv_router(
            test_payload=TEST_PAYLOAD,
            num_requests=NUM_REQUESTS,
            request_plane=request_plane,
+            router_mode=router_mode,
        )
 @pytest.mark.parametrize("store_backend", ["etcd", "file"])
 @pytest.mark.parametrize(
-    "durable_kv_events", [False], indirect=True
+    "durable_kv_events", [False], ids=["nondurable"], indirect=True
 )  # Use NATS Core (local indexer)
 @pytest.mark.timeout(180)  # bumped for xdist contention (was 60s; ~19.86s serial avg)
 def test_mocker_two_kv_router(
@@ -752,7 +762,7 @@ def test_mocker_two_kv_router(
 @pytest.mark.skip(reason="Flaky, temporarily disabled")
 @pytest.mark.parametrize(
-    "durable_kv_events", [False], indirect=True
+    "durable_kv_events", [False], ids=["nondurable"], indirect=True
 )  # Use NATS Core (local indexer)
 @pytest.mark.timeout(60)  # ~3x average (~19.86s), rounded up (when enabled)
 def test_mocker_kv_router_overload_503(
@@ -790,7 +800,7 @@ def test_mocker_kv_router_overload_503(
 @pytest.mark.timeout(90)  # bumped for xdist contention (was 22s; ~7.10s serial avg)
 @pytest.mark.parametrize("request_plane", ["nats", "tcp"], indirect=True)
 @pytest.mark.parametrize(
-    "durable_kv_events", [False], indirect=True
+    "durable_kv_events", [False], ids=["nondurable"], indirect=True
 )  # Use NATS Core (local indexer)
 def test_kv_router_bindings(
    request,
@@ -922,7 +932,7 @@ def test_indexers_sync(
 @pytest.mark.timeout(120)  # bumped for xdist contention (was 42s; ~13.80s serial avg)
 @pytest.mark.parametrize(
-    "durable_kv_events", [False], indirect=True
+    "durable_kv_events", [False], ids=["nondurable"], indirect=True
 )  # Use NATS Core (local indexer)
 def test_query_instance_id_returns_worker_and_tokens(
    request, runtime_services_dynamic_ports, predownload_tokenizers, durable_kv_events
@@ -1155,7 +1165,7 @@ def test_router_decisions_disagg(
 @pytest.mark.parametrize("request_plane", ["nats", "tcp"], indirect=True)
 @pytest.mark.parametrize(
-    "durable_kv_events", [False], indirect=True
+    "durable_kv_events", [False], ids=["nondurable"], indirect=True
 )  # Use NATS Core (local indexer)
 @pytest.mark.timeout(120)  # bumped for xdist contention (was 39s; ~12.84s serial avg)
 def test_busy_threshold_endpoint(