Unverified Commit bb43fada authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

chore: remove outdated router_standalone_trtllm example and add standalone router docs (#7278)


Signed-off-by: default avatarakshatha-k <akshutk@gmail.com>
Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarakshatha-k <akshutk@gmail.com>
parent bb07b2f4
...@@ -23,6 +23,10 @@ For Kubernetes, set `DYN_ROUTER_MODE=kv` on the Frontend service. Workers automa ...@@ -23,6 +23,10 @@ For Kubernetes, set `DYN_ROUTER_MODE=kv` on the Frontend service. Workers automa
| `--no-router-kv-events` | enabled | Fall back to approximate routing (no event consumption from workers) | | `--no-router-kv-events` | enabled | Fall back to approximate routing (no event consumption from workers) |
| `--router-queue-threshold` | disabled | Enable backpressure queue under high concurrency; also enables priority scheduling via `nvext.agent_hints.latency_sensitivity` | | `--router-queue-threshold` | disabled | Enable backpressure queue under high concurrency; also enables priority scheduling via `nvext.agent_hints.latency_sensitivity` |
### Standalone Router
You can also run the KV router as a standalone service (without the Dynamo frontend). See the [Standalone Router component](../../../components/src/dynamo/router/) for more details.
For all CLI arguments, environment variables, K8s deployment examples, and tuning guidelines, see the [Router Guide](router-guide.md). For A/B benchmarking, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md). For all CLI arguments, environment variables, K8s deployment examples, and tuning guidelines, see the [Router Guide](router-guide.md). For A/B benchmarking, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
## Prerequisites and Limitations ## Prerequisites and Limitations
......
...@@ -82,6 +82,10 @@ All CLI arguments can be configured via environment variables using the `DYN_` p ...@@ -82,6 +82,10 @@ All CLI arguments can be configured via environment variables using the `DYN_` p
For complete K8s examples and advanced configuration, see [K8s Examples](router-examples.md#k8s-examples). For complete K8s examples and advanced configuration, see [K8s Examples](router-examples.md#k8s-examples).
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md). For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
### Standalone Router
You can also run the KV router as a standalone service (without the Dynamo frontend) for disaggregated serving (e.g., routing to prefill workers), multi-tier architectures, or any scenario requiring intelligent KV cache-aware routing decisions. See the [Standalone Router component](../../../components/src/dynamo/router/) for more details.
## KV Cache Routing ## KV Cache Routing
KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency. KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Router Standalone - TensorRT-LLM
A standalone implementation of KvRouter that demonstrates usage with TensorRT-LLM workers, without dependency on the dynamo runtime, etcd control plane, or nats event plane.
## Overview
This example shows how to use KvRouter with TensorRT-LLM workers to intelligently route requests across multiple GPUs based on KV cache overlap and load metrics. The router maintains a view of each worker's cached blocks and routes new requests to the worker with the best combination of cache overlap and available capacity.
Key features:
- **KV cache-aware routing**: Routes requests to workers with matching cached blocks
- **Multimodal support**: Handles vision-language models (e.g., Qwen2-VL) with image inputs
- **MM hash routing**: Identical images produce identical hashes for cache reuse
## How It Works
### Core Architecture
The router uses a **RadixTree** data structure (written in Rust) to efficiently track which blocks each worker has cached. When a new request arrives, the router:
1. Tokenizes the request and computes block hashes (including MM hashes for images)
2. Uses `find_matches` to calculate overlap scores between the request and each worker's cached blocks
3. Combines this with current load metrics to select the optimal worker
4. Routes the request to the chosen worker for processing
### Multimodal Routing
For vision-language models:
1. Images are processed using `default_multimodal_input_loader` from TensorRT-LLM
2. Image placeholders are expanded to visual tokens using HuggingFace `AutoProcessor`
3. `apply_mm_hashes` computes a content hash for each image
4. The MM hash is included in block hash computation, so identical images produce cache hits
### Event-Driven Updates
The router receives two types of events from TensorRT-LLM engines:
1. **KV Events**: Emitted automatically when blocks are stored/removed from cache (includes `mm_keys` for multimodal)
2. **Load Metrics**: GPU cache usage and waiting request count
## Components
### `worker.py`
- **TrtllmWorkers**: Manages multiple TensorRT-LLM worker processes
- Each worker runs on a separate GPU with KV cache event emission enabled
- Publishes metrics and KV events over ZMQ
- Extracts `mm_hash` from TRTLLM's `mm_keys` field for multimodal routing
### `router.py`
- **KvRouter**: Core routing logic using RadixTree
- Subscribes to KV cache events and load metrics from workers
- Implements `get_best_worker()` to select optimal routing destination
### `api.py`
- **ServiceAPI**: FastAPI server providing OpenAI-compatible chat completions endpoint
- Handles multimodal inputs (images) via `default_multimodal_input_loader`
- Computes block hashes including MM hashes for routing decisions
- Streams responses in OpenAI format
### `test_router.py`
- Comprehensive test suite for router functionality
- Includes local hash computation tests and server-side multimodal tests
- Run with `--mm-only` for multimodal-specific tests
## Requirements
- **TensorRT-LLM >= 1.2.0rc6**: You need TensorRT-LLM version 1.2.0rc6 or later, which includes multimodal information (`mm_keys`) in KV cache events. This is required for MM hash-based routing. See [PR #9604](https://github.com/NVIDIA/TensorRT-LLM/pull/9604) for details.
- TensorRT-LLM with pytorch backend
- Multiple GPUs (one per worker)
- Python 3.10+
- Required packages: fastapi, uvicorn, httpx, zmq, tensorrt_llm, transformers
## Usage
### 1. Start the API Server
```bash
python api.py \
--model Qwen/Qwen2-VL-2B-Instruct \
--num-workers 2 \
--block-size 32 \
--base-kv-events-port 5557 \
--base-metrics-port 5657 \
--router-port 7000 \
--http-port 8000
```
This will:
- Initialize TensorRT-LLM engines on each GPU
- Start ZMQ publishers for metrics and KV events
- Start the router service
- Start the OpenAI-compatible API server
### 2. Test with curl
**Text-only request:**
```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-VL-2B-Instruct",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 100,
"stream": false
}' | jq
```
**Multimodal request (with images):**
```bash
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-VL-2B-Instruct",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe both images in detail."},
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/Sayali9141/traffic_signal_images/resolve/main/61.jpg"}},
{"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/test2017/000000000001.jpg"}}
]
}],
"max_tokens": 500,
"stream": false
}' | jq
```
### 3. Run Tests
```bash
# Run all tests
python test_router.py
# Run multimodal tests only
python test_router.py --mm-only
# Verbose output
python test_router.py -v
```
### 4. Check endpoint health
```bash
./ping.sh
```
## Configuration
### Command-line Arguments
- `--model`: HuggingFace model name (default: Qwen/Qwen2-VL-2B-Instruct)
- `--num-workers`: Number of GPU workers (default: 2)
- `--block-size`: KV cache block size (default: 32, TensorRT-LLM's default)
- `--base-kv-events-port`: Base port for KV events ZMQ (default: 5557)
- `--base-metrics-port`: Base port for metrics ZMQ (default: 5657)
- `--router-port`: Router HTTP service port (default: 7000)
- `--http-port`: API server port (default: 8000)
### Environment Variables
- `DYNAMO_DEBUG=1`: Enable debug file dumps to `/tmp/debug_*.txt`
- `LOGLEVEL=DEBUG`: Set logging level (DEBUG, INFO, WARNING, ERROR)
- `TRANSFORMERS_ATTN_IMPLEMENTATION=eager`: Disable FlashAttention (set automatically)
- `TRTLLM_MAX_NUM_TOKENS`: Set max token length
### Port Assignment
Workers use sequential ports:
- Worker 0: KV events on 5557, metrics on 5657
- Worker 1: KV events on 5558, metrics on 5658
- Worker N: KV events on 5557+N, metrics on 5657+N
## Architecture Diagram
```
┌─────────────┐
│ Client │
└──────┬──────┘
│ HTTP
┌─────────────────┐
│ API Server │
│ (api.py) │
└────────┬────────┘
│ HTTP
┌─────────────────┐
│ Router │──┐
│ (router.py) │ │ ZMQ (KV Events)
└────────┬────────┘ │
│ │
│ Select │
│ Worker │
▼ │
┌─────────────────┐ │
│ TrtllmWorkers │ │
│ (worker.py) │◄-┘
└─────────────────┘
│ │
▼ ▼
GPU 0 GPU 1
```
## Multimodal KV Cache Routing
When processing multimodal requests:
1. **API Layer** (`api.py`):
- Parses OpenAI-format messages with `image_url` content
- Uses `default_multimodal_input_loader` to process images
- Expands image placeholders to visual tokens via `AutoProcessor`
- Computes `mm_hash` using `apply_mm_hashes`
- Includes `mm_hash` in block hash computation for routing
2. **Worker Layer** (`worker.py`):
- Receives multimodal input and passes to TRTLLM
- Extracts `mm_hash` from TRTLLM's `mm_keys` in KV events
- Publishes KV events with `mm_extra_info` to router
3. **Router Layer** (`router.py`):
- RadixTree matches blocks including MM hash
- Same image content = same hash = cache hit on same worker
## Notes
- This is a standalone implementation for pedagogical purposes
- Production dynamo uses NATS for events and etcd for service discovery
- Each worker needs its own GPU
- TensorRT-LLM models may take time to compile on first run
## See Also
- [TensorRT-LLM KV Event Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/examples/llm_inference_kv_events.html)
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
This diff is collapsed.
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Simple health check - sends a basic chat request
# Model name should match what you started api.py with
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-VL-2B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 50
}' | jq
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import asyncio
import json
import logging
import os
from contextlib import asynccontextmanager
import numpy as np
import uvicorn
import zmq
import zmq.asyncio
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, ValidationError
from dynamo._core import RadixTree
logger = logging.getLogger(__name__)
DEBUG_ENABLED = os.environ.get("DYNAMO_DEBUG", "0") == "1"
def dump_kv_event(worker_id: int, event: dict):
"""Dump KV event to file for debugging (only when DYNAMO_DEBUG=1)."""
if not DEBUG_ENABLED:
return
import datetime
with open("/tmp/debug_kv_events.txt", "a") as f:
f.write(f"\n{'='*60}\n")
f.write(f"Timestamp: {datetime.datetime.now()}\n")
f.write(f"Worker ID: {worker_id}\n")
f.write(f"Event: {json.dumps(event, indent=2)}\n")
# -----------------------------------------------------------------------------
# Request/Response Models
# -----------------------------------------------------------------------------
class RouterRequest(BaseModel):
local_hashes: list[int]
num_tokens: int
class RouterResponse(BaseModel):
worker_id: int
overlap: float = 0.0
matched_blocks: int = 0
class InjectEventRequest(BaseModel):
"""For testing: inject a KV event directly into RadixTree."""
worker_id: int
tokens_hash: int
block_hash: int | None = None
mm_extra_info: dict | None = None
class LoadMetrics(BaseModel):
kv_cache_usage: float
num_waiting_reqs: int
# -----------------------------------------------------------------------------
# ZMQ Helpers
# -----------------------------------------------------------------------------
def create_zmq_subscriber(context: zmq.Context, endpoint: str) -> zmq.Socket[bytes]:
"""Create a ZMQ SUB socket with standard settings."""
socket = context.socket(zmq.SUB)
socket.connect(endpoint)
socket.setsockopt(zmq.SUBSCRIBE, b"")
socket.setsockopt(zmq.CONFLATE, 1)
socket.setsockopt(zmq.RCVTIMEO, 1)
return socket
# -----------------------------------------------------------------------------
# KvRouter Core
# -----------------------------------------------------------------------------
class KvRouter:
"""Router that uses RadixTree for KV cache-aware worker selection."""
def __init__(
self,
block_size: int = 64,
num_workers: int = 4,
base_kv_events_port: int = 5557,
base_metrics_port: int = 5657,
):
self.num_workers = num_workers
self.block_size = block_size
self.radix_tree = RadixTree()
# Per-worker metrics
self.kv_usages = [0.0] * num_workers
self.waitings = [0] * num_workers
# ZMQ setup
self.context = zmq.Context()
self.load_listeners = [
create_zmq_subscriber(
self.context, f"tcp://localhost:{base_metrics_port + i}"
)
for i in range(num_workers)
]
self.async_context = zmq.asyncio.Context()
self.kv_listeners = [
self._create_kv_listener(base_kv_events_port + i)
for i in range(num_workers)
]
self.background_tasks: list[asyncio.Task] = []
logger.info("Router initialized")
def _create_kv_listener(self, port: int) -> zmq.asyncio.Socket:
"""Create an async ZMQ SUB socket for receiving KV cache events."""
sock = self.async_context.socket(zmq.SUB)
sock.connect(f"tcp://localhost:{port}")
sock.setsockopt(zmq.SUBSCRIBE, b"")
sock.setsockopt(zmq.RCVTIMEO, 1)
return sock
# -------------------------------------------------------------------------
# Background Tasks
# -------------------------------------------------------------------------
async def start_background_tasks(self):
"""Start background tasks for load and tree updates."""
logger.info("Starting router background tasks...")
for worker_id in range(self.num_workers):
self.background_tasks.append(
asyncio.create_task(self._poll_worker_load(worker_id))
)
self.background_tasks.append(
asyncio.create_task(self._poll_worker_kv_events(worker_id))
)
async def _poll_worker_load(self, worker_id: int):
"""Poll load metrics for a single worker."""
while True:
try:
data = self.load_listeners[worker_id].recv_json(zmq.NOBLOCK)
metrics = LoadMetrics.model_validate(data)
self.kv_usages[worker_id] = metrics.kv_cache_usage
self.waitings[worker_id] = metrics.num_waiting_reqs
except zmq.Again:
pass
except (zmq.ZMQError, ValidationError) as e:
logger.warning(f"Worker {worker_id} metrics error: {e}")
except Exception:
logger.exception(f"Worker {worker_id} unexpected metrics error")
await asyncio.sleep(0.1)
async def _poll_worker_kv_events(self, worker_id: int):
"""Poll KV events for a single worker and update RadixTree."""
sock = self.kv_listeners[worker_id]
while True:
try:
event_bytes = await sock.recv(zmq.NOBLOCK)
event = json.loads(event_bytes)
dump_kv_event(worker_id, event)
self.radix_tree.apply_event(
worker_id, json.dumps(event).encode("utf-8")
)
except zmq.Again:
pass
except (zmq.ZMQError, json.JSONDecodeError) as e:
logger.warning(f"Worker {worker_id} KV events error: {e}")
except Exception:
logger.exception(f"Worker {worker_id} unexpected KV events error")
await asyncio.sleep(0.1)
# -------------------------------------------------------------------------
# Worker Selection
# -------------------------------------------------------------------------
async def get_best_worker(
self, local_hashes: list[int], num_tokens: int
) -> tuple[int, float, int]:
"""
Find best worker for request.
Returns: (worker_id, overlap_ratio, matched_blocks)
"""
if num_tokens <= 0:
raise ValueError("num_tokens must be positive")
# Get cache matches from RadixTree
matched_blocks = self._get_matched_blocks(local_hashes)
# Compute overlap scores
overlap_scores = {
wid: matched_blocks[wid] * self.block_size / num_tokens
for wid in range(self.num_workers)
}
# Compute routing logits
logits = self._compute_logits(overlap_scores)
# Select best worker (random tie-breaking)
best_id = self._select_best_worker(logits)
# Predictive update for burst handling
self.waitings[best_id] += 1
return best_id, overlap_scores[best_id], matched_blocks[best_id]
def _get_matched_blocks(self, local_hashes: list[int]) -> dict[int, int]:
"""Get matched block count per worker from RadixTree."""
result = self.radix_tree.find_matches(local_hashes)
raw_scores = result.scores
logger.info(f"Router: raw_scores={raw_scores}")
# raw_scores is keyed by (worker_id, dp_rank); assume dp_rank=0
return {wid: raw_scores.get((wid, 0), 0) for wid in range(self.num_workers)}
def _compute_logits(self, overlap_scores: dict[int, float]) -> list[float]:
"""Compute routing logits for each worker."""
max_waiting = max(self.waitings) if self.waitings else 0
logits = []
for wid in range(self.num_workers):
overlap = overlap_scores[wid]
usage = self.kv_usages[wid]
waiting_norm = self.waitings[wid] / max_waiting if max_waiting else 0.0
logit = 2 * overlap - usage - waiting_norm
logits.append(logit)
logger.info(
f"worker_id: {wid}, logit = 2 * {overlap:.3f} - {usage:.3f} - {waiting_norm:.3f} = {logit:.3f}"
)
return logits
def _select_best_worker(self, logits: list[float]) -> int:
"""Select worker with highest logit (random tie-breaking)."""
arr = np.array(logits)
return int(np.random.choice(np.flatnonzero(arr == arr.max())))
# -------------------------------------------------------------------------
# Shutdown
# -------------------------------------------------------------------------
async def shutdown(self):
"""Shutdown ZMQ listeners and background tasks."""
logger.info("Shutting down KvRouter...")
for task in self.background_tasks:
task.cancel()
if self.background_tasks:
await asyncio.gather(*self.background_tasks, return_exceptions=True)
for listener in self.load_listeners:
listener.close()
for listener in self.kv_listeners:
listener.close()
self.context.term()
self.async_context.term()
logger.info("KvRouter shutdown completed")
# -----------------------------------------------------------------------------
# Router API Server
# -----------------------------------------------------------------------------
class RouterAPI:
"""FastAPI wrapper for KvRouter."""
def __init__(
self,
block_size: int = 64,
num_workers: int = 4,
base_kv_events_port: int = 5557,
base_metrics_port: int = 5657,
port: int = 7000,
):
self.port = port
self.router_config = {
"block_size": block_size,
"num_workers": num_workers,
"base_kv_events_port": base_kv_events_port,
"base_metrics_port": base_metrics_port,
}
self.router: KvRouter | None = None
self.app = FastAPI(
title="KV Router API", version="0.0.1", lifespan=self.lifespan
)
self._setup_routes()
def _require_router(self) -> KvRouter:
"""Get router or raise 503 if not initialized."""
if self.router is None:
raise HTTPException(status_code=503, detail="Router not initialized")
return self.router
@asynccontextmanager
async def lifespan(self, app: FastAPI):
self.router = KvRouter(**self.router_config)
await self.router.start_background_tasks()
logger.info("Router API started")
yield
if self.router:
await self.router.shutdown()
def _setup_routes(self):
@self.app.post("/find_best_worker", response_model=RouterResponse)
async def find_best_worker(request: RouterRequest):
router = self._require_router()
try:
wid, overlap, matched = await router.get_best_worker(
request.local_hashes, request.num_tokens
)
return RouterResponse(
worker_id=wid, overlap=overlap, matched_blocks=matched
)
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
@self.app.get("/debug/tree_info")
async def get_tree_info():
router = self._require_router()
events = router.radix_tree.dump_tree_as_events()
return {"num_blocks": len(events), "events": events[:20]}
@self.app.post("/debug/inject_event")
async def inject_event(request: InjectEventRequest):
router = self._require_router()
block_hash = request.block_hash or request.tokens_hash
event = {
"event_id": 99999,
"data": {
"stored": {
"parent_hash": None,
"blocks": [
{
"block_hash": block_hash,
"tokens_hash": request.tokens_hash,
"mm_extra_info": request.mm_extra_info,
}
],
}
},
}
router.radix_tree.apply_event(
request.worker_id, json.dumps(event).encode("utf-8")
)
return {
"status": "ok",
"tokens_hash": request.tokens_hash,
"worker_id": request.worker_id,
}
async def start(self):
"""Start the router API server."""
logger.info(f"Starting Router API on port {self.port}")
config = uvicorn.Config(
self.app, host="0.0.0.0", port=self.port, log_level="info"
)
await uvicorn.Server(config).serve()
def main():
parser = argparse.ArgumentParser(description="KV Router API Server")
parser.add_argument(
"--block-size", type=int, default=32, help="Block size (default: 32)"
)
parser.add_argument("--num-workers", type=int, default=2, help="Number of workers")
parser.add_argument(
"--base-kv-events-port", type=int, default=5557, help="Base KV events port"
)
parser.add_argument(
"--base-metrics-port", type=int, default=5657, help="Base metrics port"
)
parser.add_argument("--port", type=int, default=7000, help="Router API port")
args = parser.parse_args()
logging.basicConfig(level=logging.INFO)
api = RouterAPI(
block_size=args.block_size,
num_workers=args.num_workers,
base_kv_events_port=args.base_kv_events_port,
base_metrics_port=args.base_metrics_port,
port=args.port,
)
asyncio.run(api.start())
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment