Unverified Commit f7e468c7 authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: fix lowest cost mistake in doc (#2361)


Signed-off-by: default avatarYan Ru Pei <yanrpei@gmail.com>
parent 5166a3dd
......@@ -4,26 +4,26 @@ SPDX-License-Identifier: Apache-2.0
-->
# KV Cache Routing
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics.
To enable KV cache aware routing start the frontend node like this:
```
python -m dynamo.frontend --router-mode kv
```
The engine announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node.
When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly.
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing.
The KV-aware routing arguments:
- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks). Defaults to 1.
- `--kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.
- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 (default) recovers the deterministic behavior where the min logit is picked.
- `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
- `--use-kv-events`/`--no-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true (default), then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
- `--use-kv-events`/`--no-kv-events`: Determines how the router tracks cached blocks. When enabled (default), uses `KvIndexer` to monitor block creation and deletion events. When disabled, uses `ApproxKvIndexer`, which estimates cache hits based on a fixed time window (120s). Disable this if your backend doesn't support KV events.
- `--router-replica-sync`: Enables state synchronization between multiple router replicas via NATS. Disabled by default, and can be enabled by passing the flag in. When enabled, router replicas share their view of KV cache distribution and active sequences, allowing all routers to make optimal routing decisions even when requests are distributed across multiple router instances. This improves fault tolerance and routing accuracy in multi-router deployments.
- `--router-replica-sync`: Enables NATS-based state synchronization between router replicas. When enabled, routers share their KV cache distribution and active sequence information, ensuring optimal routing decisions across multiple router instances. This improves fault tolerance and routing accuracy in distributed deployments. Disabled by default.
## Architecture
......@@ -107,29 +107,29 @@ Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/in
| | |
| Cached: 2 blocks | Cached: 5 blocks | Cached: 8 blocks
| Prefill: 8 blks | Prefill: 5 blks | Prefill: 2 blks
| Decode: 10 blks | Decode: 7 blks | Decode: 9 blks
| Decode: 10 blks | Decode: 5 blks | Decode: 9 blks
v v v
+----------------+ +----------------+ +----------------+
| Worker 1 | | Worker 2 | | Worker 3 |
+----------------+ +----------------+ +----------------+
```
Load balancing in LLM serving becomes complex when enabling KV Cache reuse. While KV Cache reuse can save significant computation, if the routing strategy is not aware of the unique KV states of each worker we can:
- miss opportunities for KV Cache reuse if routing to the "wrong" node
- get into an imbalanced state where a few workers are processing many requests, lowering throughput of entire system
KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
### Cost Calculation
1. **Prefill blocks**: The number of tokens that need to be processed during prefill is predicted based on the request's input tokens and the cached blocks available on each worker. This is divided by the block size to get the effective "prefill blocks". This prediction is updated when the first output token is produced, signaling prefill completion.
1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2. **Decode blocks**: The number of blocks needed during the decode phase is predicted based on the request's input tokens and the current active sequences on each worker. This is updated when the request is freed (blocks are dereferenced or freed).
2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
- Lower cost is better
- The `overlap_score_weight` parameter controls the importance of cache hits vs. load balancing
- A higher weight prioritizes cache reuse (better TTFT) while a lower weight prioritizes load distribution (better ITL)
- Lower costs indicate better routing choices
- `overlap_score_weight` balances cache hit optimization against load distribution
- Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
### Worker Selection
......@@ -137,14 +137,12 @@ The router selects the worker with the lowest cost. When `router_temperature` is
Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 7 = 12** (selected - lowest cost)
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
## Events
In Dynamo, we support KV Cache Routing for many backends that have different implementations of KV Cache. To enable this, we built a KVPublisher that can be plugged into any framework to publish KV Events.
On the receiving side we have a KVIndexer which accepts events from the KVPublisher and puts them into a global prefix tree for tracking cached blocks across all workers.
Dynamo supports KV Cache Routing across multiple backend implementations through a flexible event system. The KVPublisher component integrates with any framework to emit KV events, while the KVIndexer component maintains a global prefix tree of cached blocks by processing these events from all workers.
```text
+----------------+ +-----------------+
......@@ -174,13 +172,13 @@ The KVIndexer has a method `find_matches_for_request`, which takes in tokens and
### Inter-Router Communication
In multi-router deployments, each router only observes a subset of requests. To maintain a consistent global view of active sequences and KV cache states, routers broadcast their local actions to other replicas through three synchronization events:
In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
1. **AddRequest**: Published when assigning a request to a worker, containing the request ID, worker ID, token sequence blocks, and overlap score. This updates other routers' tracking of which blocks are in use.
1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
2. **MarkPrefillCompleted**: Published when a request transitions from prefill to decode phase, signaling that prefill tokens should no longer count toward the worker's active prefill load.
2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
3. **Free**: Published when a request completes and its resources are released, allowing other routers to update their block reference counts.
3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
Each event includes a unique router ID to prevent processing of self-generated events. This asynchronous communication ensures all routers maintain synchronized KV cache state for optimal routing decisions despite handling different request streams.
Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
......@@ -7,7 +7,7 @@ SPDX-License-Identifier: Apache-2.0
## Overview
Dynamo's KV Router makes intelligent routing decisions by evaluating the computational cost of processing requests on different workers. The router considers both the decoding cost (active blocks) and prefill cost (new blocks that need to be computed). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in a distributed inference setup.
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
## Quick Start
......@@ -22,10 +22,10 @@ This command:
- Exposes the service on port 8080 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers can register themselves using the `register_llm` API, and the KV Router will automatically include them in its routing decisions. The router will:
- Track the state of all registered workers
- Make intelligent routing decisions based on KV cache overlap
- Balance load across available workers
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
- Tracks the state of all registered workers
- Makes routing decisions based on KV cache overlap
- Balances load across available workers
### Important Arguments
......@@ -51,9 +51,9 @@ python -m dynamo.frontend --help
The KV Router tracks two key metrics for each worker:
1. **Potential Active Blocks**: The total number of blocks that would be actively used for decoding if a request were routed to that worker. This includes existing active blocks plus new blocks from the incoming request.
1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
2. **Potential New Prefill Blocks**: The number of new tokens that would need to be prefilled (computed from scratch) on that worker, calculated as:
2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
- New prefill tokens = Total input tokens - (Overlap blocks × Block size)
- Potential prefill blocks = New prefill tokens / Block size
......@@ -61,12 +61,12 @@ The KV Router tracks two key metrics for each worker:
The router maintains block information through two complementary systems:
- **Active Decoding Blocks**: Tracked locally by the router based on the request lifecycle:
- Incremented when a new request is added
- Updated as new tokens are generated
- Decremented when a request completes
- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
- Incremented when adding a new request
- Updated during token generation
- Decremented upon request completion
- **Cached Blocks**: Maintained globally by the KvIndexer, which builds a prefix tree from KV events reported by workers. This provides accurate overlap information for routing decisions.
- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
## Cost Function
......@@ -96,18 +96,18 @@ The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance betw
## KV Events vs. Approximation Mode
By default, the router uses KV events from workers to maintain an accurate global view of cached blocks. However, you can disable this with the `--no-kv-events` flag:
The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
- **With KV Events (default)**:
- Accurate overlap calculation based on actual cached blocks
- Higher accuracy but requires event processing overhead
- Best for production deployments
- Calculates overlap accurately using actual cached blocks
- Provides higher accuracy with event processing overhead
- Recommended for production deployments
- **Without KV Events (--no-kv-events)**:
- Uses the ApproxKvIndexer to approximate cached blocks based on routing decisions
- Assumes that recently routed requests will have their blocks cached
- Lower overhead but potentially less accurate routing
- Useful for testing or environments where event processing is a bottleneck
- Uses ApproxKvIndexer to estimate cached blocks from routing decisions
- Assumes blocks from recent requests remain cached
- Reduces overhead at the cost of routing accuracy
- Suitable for testing or when event processing becomes a bottleneck
## Tuning Guidelines
......@@ -138,9 +138,9 @@ The `router_temperature` parameter controls routing randomness:
### 4. Iterative Optimization
1. Start with default settings
1. Begin with default settings
2. Monitor TTFT and ITL metrics
3. Adjust `kv-overlap-score-weight` based on your optimization goals:
- If TTFT is too high: Increase the weight
- If ITL is too high: Decrease the weight
4. Increase temperature if severe load imbalance occurs
\ No newline at end of file
3. Adjust `kv-overlap-score-weight` to meet your performance goals:
- To reduce TTFT: Increase the weight
- To reduce ITL: Decrease the weight
4. If you observe severe load imbalance, increase the temperature setting
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment