Unverified Commit 096d117d authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: update router docs (#2148)

parent 708d7c3f
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# KV Router
## Overview
Dynamo's KV Router makes intelligent routing decisions by evaluating the computational cost of processing requests on different workers. The router considers both the decoding cost (active blocks) and prefill cost (new blocks that need to be computed). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in a distributed inference setup.
## Quick Start
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8080
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8080 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers can register themselves using the `register_llm` API, and the KV Router will automatically include them in its routing decisions. The router will:
- Track the state of all registered workers
- Make intelligent routing decisions based on KV cache overlap
- Balance load across available workers
### Important Arguments
The KV Router supports several key configuration options:
- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
- `0.0`: Deterministic selection of the best worker
- `> 0.0`: Probabilistic selection using softmax sampling
- Higher values increase randomness, helping prevent worker saturation
- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
- `--kv-events`: Uses real-time events from workers for accurate cache tracking
- `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)
For a complete list of available options:
```bash
python -m dynamo.frontend --help
```
## KV Router Architecture
The KV Router tracks two key metrics for each worker:
1. **Potential Active Blocks**: The total number of blocks that would be actively used for decoding if a request were routed to that worker. This includes existing active blocks plus new blocks from the incoming request.
2. **Potential New Prefill Blocks**: The number of new tokens that would need to be prefilled (computed from scratch) on that worker, calculated as:
- New prefill tokens = Total input tokens - (Overlap blocks × Block size)
- Potential prefill blocks = New prefill tokens / Block size
### Block Tracking Mechanisms
The router maintains block information through two complementary systems:
- **Active Decoding Blocks**: Tracked locally by the router based on the request lifecycle:
- Incremented when a new request is added
- Updated as new tokens are generated
- Decremented when a request completes
- **Cached Blocks**: Maintained globally by the KvIndexer, which builds a prefix tree from KV events reported by workers. This provides accurate overlap information for routing decisions.
## Cost Function
The KV Router's routing decision is based on a simple cost function:
```
logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
```
Where:
- Lower logit values are better (less computational cost)
- The router uses softmax sampling with optional temperature to select workers
### Key Parameter: kv-overlap-score-weight
The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:
- **Higher values (> 1.0)**: Emphasize reducing prefill cost
- Prioritizes routing to workers with better cache hits
- Optimizes for Time To First Token (TTFT)
- Best for workloads where initial response latency is critical
- **Lower values (< 1.0)**: Emphasize decode performance
- Distributes active decoding blocks more evenly
- Optimizes for Inter-Token Latency (ITL)
- Best for workloads with long generation sequences
## KV Events vs. Approximation Mode
By default, the router uses KV events from workers to maintain an accurate global view of cached blocks. However, you can disable this with the `--no-kv-events` flag:
- **With KV Events (default)**:
- Accurate overlap calculation based on actual cached blocks
- Higher accuracy but requires event processing overhead
- Best for production deployments
- **Without KV Events (--no-kv-events)**:
- Uses the ApproxKvIndexer to approximate cached blocks based on routing decisions
- Assumes that recently routed requests will have their blocks cached
- Lower overhead but potentially less accurate routing
- Useful for testing or environments where event processing is a bottleneck
## Tuning Guidelines
### 1. Understand Your Workload Characteristics
- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
### 2. Monitor Key Metrics
The router logs the cost calculation for each worker:
```
Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
```
This shows:
- Total cost (125.3)
- Overlap weight × prefill blocks (1.0 × 100.5)
- Active blocks (25.0)
- Cached blocks that contribute to overlap (15)
### 3. Temperature-Based Routing
The `router_temperature` parameter controls routing randomness:
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution
### 4. Iterative Optimization
1. Start with default settings
2. Monitor TTFT and ITL metrics
3. Adjust `kv-overlap-score-weight` based on your optimization goals:
- If TTFT is too high: Increase the weight
- If ITL is too high: Decrease the weight
4. Increase temperature if severe load imbalance occurs
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
>[!NOTE]
>This information is temporary and will change soon.
# KV Router Performance Tuning
## Overview
Dynamo's KV Router listens to KV events from worker nodes to build a global prefix tree of KV caches. This enables the router to predict the KV hit rate per worker (overlap score) for incoming requests and make intelligent routing decisions. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in a distributed inference setup.
## KV Router Architecture
The KV Router maintains a global view of all KV caches across workers. When a new request arrives:
1. The router calculates an **overlap score** for each worker by finding matching blocks between the request and the prefix cache
2. It collects runtime metrics from each worker including **KV cache usage** and **waiting request count**
3. It applies a cost function to determine the optimal worker for the request
More details can be found in docs/kv_cache_routing.md
## Cost Function Tuning
The KV Router's decision-making is primarily controlled by its cost function, which can be customized in `kv_router.py`. The default cost function is:
```python
worker_logits[worker_id] = 2 * score - gpu_cache_usage - normalized_waiting
```
Where:
- `score`: Normalized overlap score (matching blocks × block_size / token_length)
- `gpu_cache_usage`: Percentage of GPU KV cache currently in use
- `normalized_waiting`: Number of waiting requests normalized by the max waiting requests across all workers
The router selects the worker with the highest logit value. In the event of a tie, it randomly chooses among the top-scoring workers.
Alternatively, applying a softmax to the logits and sampling based on the resulting probabilities can introduce stochasticity into the routing process.
This probabilistic approach helps prevent a failure mode where one worker receives a disproportionate number of requests, saturating its prefix cache.
Such saturation can create a feedback loop—where the cache-rich worker continues to be selected—making it difficult to break the cycle deterministically.
### Key Tuning Parameters
1. **Overlap Score Weight** (default: 2.0)
- Higher values prioritize KV cache reuse
- Lower values allow more even distribution of requests
2. **GPU Cache Usage Weight** (default: 1.0)
- Higher values avoid workers with nearly full KV caches
- Lower values ignore KV cache utilization
3. **Waiting Requests Weight** (default: 1.0)
- Higher values avoid workers with queued requests
- Lower values ignore queue lengths
## Tuning Guidelines
Currently, optimal use of our KV router requires understanding your backend engine's capacity and the prefix structure of your data. We provide analysis tools for this purpose in the `benchmarks` directory. In the future, we plan to enable automatic tuning of our KV router (via `Planner`) using worker feedback metrics and dynamic analysis of data prefix structures (WIP). Below are several tips we recommend following.
### 1. Consider Total KV Block Allocation
Check the total number of KV blocks allocated for your backend engine. For smaller models (e.g., 8B parameters), this can exceed one million blocks. In such cases:
- Reduce the weight on KV cache usage (`gpu_cache_usage`) since exhausting KV cache is less likely
- Focus more on overlap score and waiting requests
### 2. Analyze Your Dataset's Theoretical Hit Rate
Consider the expected theoretical hit rate of your dataset (assuming perfect caching):
- More formally, consider the depth of your core prefix tree (nodes visited at least twice)
- For lower hit rates, or if the core prefix tree depth is short compared to the ISL,
reduce the overlap score weight
- Alternatively, normalize the overlap score with the input sequence length (ISL)
### 3. Consider Prefix Tree Breadth
The breadth of your prefix tree can be proxied by how many unique context prompts you expect:
- If you can identify distinct "buckets" of similar prompts, each containing roughly equal number of prompts,
consider using multiple KV routers, one for each bucket
- Use a meta-router to direct requests to specialized KV routers for different prompt categories
- For very diverse context prompts, overlap scores should probably be prioritized
### 4. Balance Latency vs. Throughput
The weights directly impact your service level objectives:
- Higher weights on waiting requests improve latency but may reduce throughput
- Higher weights on overlap score may improve throughput but could increase tail latency
## Alternative Routing Strategies
The default strategy uses greedy selection (highest logit wins), but other approaches can be implemented:
- **Softmax Sampling**: Converts logits to probabilities and samples workers probabilistically
- **Temperature-Based Sampling**: Adds a temperature parameter to control sampling randomness
- **Two-Stage Routing**: For example, using a round-robin as a meta router, to route to multiple kv routers
## Monitoring and Refinement
To effectively tune your KV Router:
1. Monitor the router logs to see actual logit calculations for each worker
2. Track hit rates, latency, and throughput metrics
3. Iteratively adjust weights based on observed performance
4. Consider dynamically adjusting weights based on current load conditions
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment