chore: remove outdated router_standalone_trtllm example and add standalone router docs (#7278)

Signed-off-by: akshatha-k <akshutk@gmail.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: akshatha-k <akshutk@gmail.com>

chore: remove outdated router_standalone_trtllm example and add standalone router docs (#7278)
Signed-off-by: akshatha-k <akshutk@gmail.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: akshatha-k <akshutk@gmail.com>
bb43fada · dagil-nvidia · GitHub · bb07b2f4 · bb43fada · bb43fada
Unverified Commit bb43fada authored Mar 12, 2026 by dagil-nvidia Committed by GitHub Mar 12, 2026
9 changed files
--- a/docs/components/router/README.md
+++ b/docs/components/router/README.md
@@ -23,6 +23,10 @@ For Kubernetes, set `DYN_ROUTER_MODE=kv` on the Frontend service. Workers automa
 | `--no-router-kv-events` | enabled | Fall back to approximate routing (no event consumption from workers) |
 | `--router-queue-threshold` | disabled | Enable backpressure queue under high concurrency; also enables priority scheduling via `nvext.agent_hints.latency_sensitivity` |
+### Standalone Router
+You can also run the KV router as a standalone service (without the Dynamo frontend). See the [Standalone Router component](../../../components/src/dynamo/router/) for more details.
 For all CLI arguments, environment variables, K8s deployment examples, and tuning guidelines, see the [Router Guide](router-guide.md). For A/B benchmarking, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
 ## Prerequisites and Limitations

--- a/docs/components/router/router-guide.md
+++ b/docs/components/router/router-guide.md
@@ -82,6 +82,10 @@ All CLI arguments can be configured via environment variables using the `DYN_` p
 For complete K8s examples and advanced configuration, see [K8s Examples](router-examples.md#k8s-examples).
 For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
+### Standalone Router
+You can also run the KV router as a standalone service (without the Dynamo frontend) for disaggregated serving (e.g., routing to prefill workers), multi-tier architectures, or any scenario requiring intelligent KV cache-aware routing decisions. See the [Standalone Router component](../../../components/src/dynamo/router/) for more details.
 ## KV Cache Routing
 KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.

--- a/examples/deployments/router_standalone_trtllm/README.md
+++ b/examples/deployments/router_standalone_trtllm/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-https://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Router Standalone - TensorRT-LLM
-A standalone implementation of KvRouter that demonstrates usage with TensorRT-LLM workers, without dependency on the dynamo runtime, etcd control plane, or nats event plane.
-## Overview
-This example shows how to use KvRouter with TensorRT-LLM workers to intelligently route requests across multiple GPUs based on KV cache overlap and load metrics. The router maintains a view of each worker's cached blocks and routes new requests to the worker with the best combination of cache overlap and available capacity.
-Key features:
- **KV cache-aware routing**: Routes requests to workers with matching cached blocks
- **Multimodal support**: Handles vision-language models (e.g., Qwen2-VL) with image inputs
- **MM hash routing**: Identical images produce identical hashes for cache reuse
-## How It Works
-### Core Architecture
-The router uses a **RadixTree** data structure (written in Rust) to efficiently track which blocks each worker has cached. When a new request arrives, the router:
-1. Tokenizes the request and computes block hashes (including MM hashes for images)
-2. Uses `find_matches` to calculate overlap scores between the request and each worker's cached blocks
-3. Combines this with current load metrics to select the optimal worker
-4. Routes the request to the chosen worker for processing
-### Multimodal Routing
-For vision-language models:
-1. Images are processed using `default_multimodal_input_loader` from TensorRT-LLM
-2. Image placeholders are expanded to visual tokens using HuggingFace `AutoProcessor`
-3. `apply_mm_hashes` computes a content hash for each image
-4. The MM hash is included in block hash computation, so identical images produce cache hits
-### Event-Driven Updates
-The router receives two types of events from TensorRT-LLM engines:
-1. **KV Events**: Emitted automatically when blocks are stored/removed from cache (includes `mm_keys` for multimodal)
-2. **Load Metrics**: GPU cache usage and waiting request count
-## Components
-### `worker.py`
- **TrtllmWorkers**: Manages multiple TensorRT-LLM worker processes
- Each worker runs on a separate GPU with KV cache event emission enabled
- Publishes metrics and KV events over ZMQ
- Extracts `mm_hash` from TRTLLM's `mm_keys` field for multimodal routing
-### `router.py`
- **KvRouter**: Core routing logic using RadixTree
- Subscribes to KV cache events and load metrics from workers
- Implements `get_best_worker()` to select optimal routing destination
-### `api.py`
- **ServiceAPI**: FastAPI server providing OpenAI-compatible chat completions endpoint
- Handles multimodal inputs (images) via `default_multimodal_input_loader`
- Computes block hashes including MM hashes for routing decisions
- Streams responses in OpenAI format
-### `test_router.py`
- Comprehensive test suite for router functionality
- Includes local hash computation tests and server-side multimodal tests
- Run with `--mm-only` for multimodal-specific tests
-## Requirements
- **TensorRT-LLM >= 1.2.0rc6**: You need TensorRT-LLM version 1.2.0rc6 or later, which includes multimodal information (`mm_keys`) in KV cache events. This is required for MM hash-based routing. See [PR #9604](https://github.com/NVIDIA/TensorRT-LLM/pull/9604) for details.
- TensorRT-LLM with pytorch backend
- Multiple GPUs (one per worker)
- Python 3.10+
- Required packages: fastapi, uvicorn, httpx, zmq, tensorrt_llm, transformers
-## Usage
-### 1. Start the API Server
-```bash
-python api.py \
-  --model Qwen/Qwen2-VL-2B-Instruct \
-  --num-workers 2 \
-  --block-size 32 \
-  --base-kv-events-port 5557 \
-  --base-metrics-port 5657 \
-  --router-port 7000 \
-  --http-port 8000
-```
-This will:
- Initialize TensorRT-LLM engines on each GPU
- Start ZMQ publishers for metrics and KV events
- Start the router service
- Start the OpenAI-compatible API server
-### 2. Test with curl
-**Text-only request:**
-```bash
-curl -s http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen2-VL-2B-Instruct",
-    "messages": [{"role": "user", "content": "Hello, how are you?"}],
-    "max_tokens": 100,
-    "stream": false
-  }' | jq
-```
-**Multimodal request (with images):**
-```bash
-curl -s -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen2-VL-2B-Instruct",
-    "messages": [{
-      "role": "user",
-      "content": [
-        {"type": "text", "text": "Describe both images in detail."},
-        {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/Sayali9141/traffic_signal_images/resolve/main/61.jpg"}},
-        {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/test2017/000000000001.jpg"}}
-      ]
-    }],
-    "max_tokens": 500,
-    "stream": false
-  }' | jq
-```
-### 3. Run Tests
-```bash
-# Run all tests
-python test_router.py
-# Run multimodal tests only
-python test_router.py --mm-only
-# Verbose output
-python test_router.py -v
-```
-### 4. Check endpoint health
-```bash
-./ping.sh
-```
-## Configuration
-### Command-line Arguments
- `--model`: HuggingFace model name (default: Qwen/Qwen2-VL-2B-Instruct)
- `--num-workers`: Number of GPU workers (default: 2)
- `--block-size`: KV cache block size (default: 32, TensorRT-LLM's default)
- `--base-kv-events-port`: Base port for KV events ZMQ (default: 5557)
- `--base-metrics-port`: Base port for metrics ZMQ (default: 5657)
- `--router-port`: Router HTTP service port (default: 7000)
- `--http-port`: API server port (default: 8000)
-### Environment Variables
- `DYNAMO_DEBUG=1`: Enable debug file dumps to `/tmp/debug_*.txt`
- `LOGLEVEL=DEBUG`: Set logging level (DEBUG, INFO, WARNING, ERROR)
- `TRANSFORMERS_ATTN_IMPLEMENTATION=eager`: Disable FlashAttention (set automatically)
- `TRTLLM_MAX_NUM_TOKENS`: Set max token length
-### Port Assignment
-Workers use sequential ports:
- Worker 0: KV events on 5557, metrics on 5657
- Worker 1: KV events on 5558, metrics on 5658
- Worker N: KV events on 5557+N, metrics on 5657+N
-## Architecture Diagram
-```
-┌─────────────┐
-│   Client    │
-└──────┬──────┘
-       │ HTTP
-       ▼
-┌─────────────────┐
-│   API Server    │
-│   (api.py)      │
-└────────┬────────┘
-         │ HTTP
-         ▼
-┌─────────────────┐
-│     Router      │──┐
-│   (router.py)   │  │ ZMQ (KV Events)
-└────────┬────────┘  │
-         │           │
-         │ Select    │
-         │ Worker    │
-         ▼           │
-┌─────────────────┐  │
-│  TrtllmWorkers  │  │
-│   (worker.py)   │◄-┘
-└─────────────────┘
-    │         │
-    ▼         ▼
-  GPU 0     GPU 1
-```
-## Multimodal KV Cache Routing
-When processing multimodal requests:
-1. **API Layer** (`api.py`):
-   - Parses OpenAI-format messages with `image_url` content
-   - Uses `default_multimodal_input_loader` to process images
-   - Expands image placeholders to visual tokens via `AutoProcessor`
-   - Computes `mm_hash` using `apply_mm_hashes`
-   - Includes `mm_hash` in block hash computation for routing
-2. **Worker Layer** (`worker.py`):
-   - Receives multimodal input and passes to TRTLLM
-   - Extracts `mm_hash` from TRTLLM's `mm_keys` in KV events
-   - Publishes KV events with `mm_extra_info` to router
-3. **Router Layer** (`router.py`):
-   - RadixTree matches blocks including MM hash
-   - Same image content = same hash = cache hit on same worker
-## Notes
- This is a standalone implementation for pedagogical purposes
- Production dynamo uses NATS for events and etcd for service discovery
- Each worker needs its own GPU
- TensorRT-LLM models may take time to compile on first run
-## See Also
- [TensorRT-LLM KV Event Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/examples/llm_inference_kv_events.html)
--- a/examples/deployments/router_standalone_trtllm/__init__.py
+++ b/examples/deployments/router_standalone_trtllm/__init__.py
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
--- a/examples/deployments/router_standalone_trtllm/api.py
+++ b/examples/deployments/router_standalone_trtllm/api.py
--- a/examples/deployments/router_standalone_trtllm/ping.sh
+++ b/examples/deployments/router_standalone_trtllm/ping.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Simple health check - sends a basic chat request
-# Model name should match what you started api.py with
-curl -s -X POST http://localhost:8000/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -d '{
-    "model": "Qwen/Qwen2-VL-2B-Instruct",
-    "messages": [{"role": "user", "content": "Hello!"}],
-    "stream": false,
-    "max_tokens": 50
-    }' | jq
\ No newline at end of file
--- a/examples/deployments/router_standalone_trtllm/router.py
+++ b/examples/deployments/router_standalone_trtllm/router.py
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-import argparse
-import asyncio
-import json
-import logging
-import os
-from contextlib import asynccontextmanager
-import numpy as np
-import uvicorn
-import zmq
-import zmq.asyncio
-from fastapi import FastAPI, HTTPException
-from pydantic import BaseModel, ValidationError
-from dynamo._core import RadixTree
-logger = logging.getLogger(__name__)
-DEBUG_ENABLED = os.environ.get("DYNAMO_DEBUG", "0") == "1"
-def dump_kv_event(worker_id: int, event: dict):
-    """Dump KV event to file for debugging (only when DYNAMO_DEBUG=1)."""
-    if not DEBUG_ENABLED:
-        return
-    import datetime
-    with open("/tmp/debug_kv_events.txt", "a") as f:
-        f.write(f"\n{'='*60}\n")
-        f.write(f"Timestamp: {datetime.datetime.now()}\n")
-        f.write(f"Worker ID: {worker_id}\n")
-        f.write(f"Event: {json.dumps(event, indent=2)}\n")
-# -----------------------------------------------------------------------------
-# Request/Response Models
-# -----------------------------------------------------------------------------
-class RouterRequest(BaseModel):
-    local_hashes: list[int]
-    num_tokens: int
-class RouterResponse(BaseModel):
-    worker_id: int
-    overlap: float = 0.0
-    matched_blocks: int = 0
-class InjectEventRequest(BaseModel):
-    """For testing: inject a KV event directly into RadixTree."""
-    worker_id: int
-    tokens_hash: int
-    block_hash: int | None = None
-    mm_extra_info: dict | None = None
-class LoadMetrics(BaseModel):
-    kv_cache_usage: float
-    num_waiting_reqs: int
-# -----------------------------------------------------------------------------
-# ZMQ Helpers
-# -----------------------------------------------------------------------------
-def create_zmq_subscriber(context: zmq.Context, endpoint: str) -> zmq.Socket[bytes]:
-    """Create a ZMQ SUB socket with standard settings."""
-    socket = context.socket(zmq.SUB)
-    socket.connect(endpoint)
-    socket.setsockopt(zmq.SUBSCRIBE, b"")
-    socket.setsockopt(zmq.CONFLATE, 1)
-    socket.setsockopt(zmq.RCVTIMEO, 1)
-    return socket
-# -----------------------------------------------------------------------------
-# KvRouter Core
-# -----------------------------------------------------------------------------
-class KvRouter:
-    """Router that uses RadixTree for KV cache-aware worker selection."""
-    def __init__(
-        self,
-        block_size: int = 64,
-        num_workers: int = 4,
-        base_kv_events_port: int = 5557,
-        base_metrics_port: int = 5657,
-    ):
-        self.num_workers = num_workers
-        self.block_size = block_size
-        self.radix_tree = RadixTree()
-        # Per-worker metrics
-        self.kv_usages = [0.0] * num_workers
-        self.waitings = [0] * num_workers
-        # ZMQ setup
-        self.context = zmq.Context()
-        self.load_listeners = [
-            create_zmq_subscriber(
-                self.context, f"tcp://localhost:{base_metrics_port + i}"
-            )
-            for i in range(num_workers)
-        ]
-        self.async_context = zmq.asyncio.Context()
-        self.kv_listeners = [
-            self._create_kv_listener(base_kv_events_port + i)
-            for i in range(num_workers)
-        ]
-        self.background_tasks: list[asyncio.Task] = []
-        logger.info("Router initialized")
-    def _create_kv_listener(self, port: int) -> zmq.asyncio.Socket:
-        """Create an async ZMQ SUB socket for receiving KV cache events."""
-        sock = self.async_context.socket(zmq.SUB)
-        sock.connect(f"tcp://localhost:{port}")
-        sock.setsockopt(zmq.SUBSCRIBE, b"")
-        sock.setsockopt(zmq.RCVTIMEO, 1)
-        return sock
-    # -------------------------------------------------------------------------
-    # Background Tasks
-    # -------------------------------------------------------------------------
-    async def start_background_tasks(self):
-        """Start background tasks for load and tree updates."""
-        logger.info("Starting router background tasks...")
-        for worker_id in range(self.num_workers):
-            self.background_tasks.append(
-                asyncio.create_task(self._poll_worker_load(worker_id))
-            )
-            self.background_tasks.append(
-                asyncio.create_task(self._poll_worker_kv_events(worker_id))
-            )
-    async def _poll_worker_load(self, worker_id: int):
-        """Poll load metrics for a single worker."""
-        while True:
-            try:
-                data = self.load_listeners[worker_id].recv_json(zmq.NOBLOCK)
-                metrics = LoadMetrics.model_validate(data)
-                self.kv_usages[worker_id] = metrics.kv_cache_usage
-                self.waitings[worker_id] = metrics.num_waiting_reqs
-            except zmq.Again:
-                pass
-            except (zmq.ZMQError, ValidationError) as e:
-                logger.warning(f"Worker {worker_id} metrics error: {e}")
-            except Exception:
-                logger.exception(f"Worker {worker_id} unexpected metrics error")
-            await asyncio.sleep(0.1)
-    async def _poll_worker_kv_events(self, worker_id: int):
-        """Poll KV events for a single worker and update RadixTree."""
-        sock = self.kv_listeners[worker_id]
-        while True:
-            try:
-                event_bytes = await sock.recv(zmq.NOBLOCK)
-                event = json.loads(event_bytes)
-                dump_kv_event(worker_id, event)
-                self.radix_tree.apply_event(
-                    worker_id, json.dumps(event).encode("utf-8")
-                )
-            except zmq.Again:
-                pass
-            except (zmq.ZMQError, json.JSONDecodeError) as e:
-                logger.warning(f"Worker {worker_id} KV events error: {e}")
-            except Exception:
-                logger.exception(f"Worker {worker_id} unexpected KV events error")
-            await asyncio.sleep(0.1)
-    # -------------------------------------------------------------------------
-    # Worker Selection
-    # -------------------------------------------------------------------------
-    async def get_best_worker(
-        self, local_hashes: list[int], num_tokens: int
-    ) -> tuple[int, float, int]:
-        """
-        Find best worker for request.
-        Returns: (worker_id, overlap_ratio, matched_blocks)
-        """
-        if num_tokens <= 0:
-            raise ValueError("num_tokens must be positive")
-        # Get cache matches from RadixTree
-        matched_blocks = self._get_matched_blocks(local_hashes)
-        # Compute overlap scores
-        overlap_scores = {
-            wid: matched_blocks[wid] * self.block_size / num_tokens
-            for wid in range(self.num_workers)
-        }
-        # Compute routing logits
-        logits = self._compute_logits(overlap_scores)
-        # Select best worker (random tie-breaking)
-        best_id = self._select_best_worker(logits)
-        # Predictive update for burst handling
-        self.waitings[best_id] += 1
-        return best_id, overlap_scores[best_id], matched_blocks[best_id]
-    def _get_matched_blocks(self, local_hashes: list[int]) -> dict[int, int]:
-        """Get matched block count per worker from RadixTree."""
-        result = self.radix_tree.find_matches(local_hashes)
-        raw_scores = result.scores
-        logger.info(f"Router: raw_scores={raw_scores}")
-        # raw_scores is keyed by (worker_id, dp_rank); assume dp_rank=0
-        return {wid: raw_scores.get((wid, 0), 0) for wid in range(self.num_workers)}
-    def _compute_logits(self, overlap_scores: dict[int, float]) -> list[float]:
-        """Compute routing logits for each worker."""
-        max_waiting = max(self.waitings) if self.waitings else 0
-        logits = []
-        for wid in range(self.num_workers):
-            overlap = overlap_scores[wid]
-            usage = self.kv_usages[wid]
-            waiting_norm = self.waitings[wid] / max_waiting if max_waiting else 0.0
-            logit = 2 * overlap - usage - waiting_norm
-            logits.append(logit)
-            logger.info(
-                f"worker_id: {wid}, logit = 2 * {overlap:.3f} - {usage:.3f} - {waiting_norm:.3f} = {logit:.3f}"
-            )
-        return logits
-    def _select_best_worker(self, logits: list[float]) -> int:
-        """Select worker with highest logit (random tie-breaking)."""
-        arr = np.array(logits)
-        return int(np.random.choice(np.flatnonzero(arr == arr.max())))
-    # -------------------------------------------------------------------------
-    # Shutdown
-    # -------------------------------------------------------------------------
-    async def shutdown(self):
-        """Shutdown ZMQ listeners and background tasks."""
-        logger.info("Shutting down KvRouter...")
-        for task in self.background_tasks:
-            task.cancel()
-        if self.background_tasks:
-            await asyncio.gather(*self.background_tasks, return_exceptions=True)
-        for listener in self.load_listeners:
-            listener.close()
-        for listener in self.kv_listeners:
-            listener.close()
-        self.context.term()
-        self.async_context.term()
-        logger.info("KvRouter shutdown completed")
-# -----------------------------------------------------------------------------
-# Router API Server
-# -----------------------------------------------------------------------------
-class RouterAPI:
-    """FastAPI wrapper for KvRouter."""
-    def __init__(
-        self,
-        block_size: int = 64,
-        num_workers: int = 4,
-        base_kv_events_port: int = 5557,
-        base_metrics_port: int = 5657,
-        port: int = 7000,
-    ):
-        self.port = port
-        self.router_config = {
-            "block_size": block_size,
-            "num_workers": num_workers,
-            "base_kv_events_port": base_kv_events_port,
-            "base_metrics_port": base_metrics_port,
-        }
-        self.router: KvRouter | None = None
-        self.app = FastAPI(
-            title="KV Router API", version="0.0.1", lifespan=self.lifespan
-        )
-        self._setup_routes()
-    def _require_router(self) -> KvRouter:
-        """Get router or raise 503 if not initialized."""
-        if self.router is None:
-            raise HTTPException(status_code=503, detail="Router not initialized")
-        return self.router
-    @asynccontextmanager
-    async def lifespan(self, app: FastAPI):
-        self.router = KvRouter(**self.router_config)
-        await self.router.start_background_tasks()
-        logger.info("Router API started")
-        yield
-        if self.router:
-            await self.router.shutdown()
-    def _setup_routes(self):
-        @self.app.post("/find_best_worker", response_model=RouterResponse)
-        async def find_best_worker(request: RouterRequest):
-            router = self._require_router()
-            try:
-                wid, overlap, matched = await router.get_best_worker(
-                    request.local_hashes, request.num_tokens
-                )
-                return RouterResponse(
-                    worker_id=wid, overlap=overlap, matched_blocks=matched
-                )
-            except ValueError as e:
-                raise HTTPException(status_code=400, detail=str(e))
-        @self.app.get("/debug/tree_info")
-        async def get_tree_info():
-            router = self._require_router()
-            events = router.radix_tree.dump_tree_as_events()
-            return {"num_blocks": len(events), "events": events[:20]}
-        @self.app.post("/debug/inject_event")
-        async def inject_event(request: InjectEventRequest):
-            router = self._require_router()
-            block_hash = request.block_hash or request.tokens_hash
-            event = {
-                "event_id": 99999,
-                "data": {
-                    "stored": {
-                        "parent_hash": None,
-                        "blocks": [
-                            {
-                                "block_hash": block_hash,
-                                "tokens_hash": request.tokens_hash,
-                                "mm_extra_info": request.mm_extra_info,
-                            }
-                        ],
-                    }
-                },
-            }
-            router.radix_tree.apply_event(
-                request.worker_id, json.dumps(event).encode("utf-8")
-            )
-            return {
-                "status": "ok",
-                "tokens_hash": request.tokens_hash,
-                "worker_id": request.worker_id,
-            }
-    async def start(self):
-        """Start the router API server."""
-        logger.info(f"Starting Router API on port {self.port}")
-        config = uvicorn.Config(
-            self.app, host="0.0.0.0", port=self.port, log_level="info"
-        )
-        await uvicorn.Server(config).serve()
-def main():
-    parser = argparse.ArgumentParser(description="KV Router API Server")
-    parser.add_argument(
-        "--block-size", type=int, default=32, help="Block size (default: 32)"
-    )
-    parser.add_argument("--num-workers", type=int, default=2, help="Number of workers")
-    parser.add_argument(
-        "--base-kv-events-port", type=int, default=5557, help="Base KV events port"
-    )
-    parser.add_argument(
-        "--base-metrics-port", type=int, default=5657, help="Base metrics port"
-    )
-    parser.add_argument("--port", type=int, default=7000, help="Router API port")
-    args = parser.parse_args()
-    logging.basicConfig(level=logging.INFO)
-    api = RouterAPI(
-        block_size=args.block_size,
-        num_workers=args.num_workers,
-        base_kv_events_port=args.base_kv_events_port,
-        base_metrics_port=args.base_metrics_port,
-        port=args.port,
-    )
-    asyncio.run(api.start())
-if __name__ == "__main__":
-    main()
--- a/examples/deployments/router_standalone_trtllm/test_router.py
+++ b/examples/deployments/router_standalone_trtllm/test_router.py
--- a/examples/deployments/router_standalone_trtllm/worker.py
+++ b/examples/deployments/router_standalone_trtllm/worker.py