Unverified Commit f7e468c7 authored by Yan Ru Pei's avatar Yan Ru Pei Committed by GitHub
Browse files

docs: fix lowest cost mistake in doc (#2361)


Signed-off-by: default avatarYan Ru Pei <yanrpei@gmail.com>
parent 5166a3dd
...@@ -4,26 +4,26 @@ SPDX-License-Identifier: Apache-2.0 ...@@ -4,26 +4,26 @@ SPDX-License-Identifier: Apache-2.0
--> -->
# KV Cache Routing # KV Cache Routing
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers. This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics.
To enable KV cache aware routing start the frontend node like this: To enable KV cache aware routing start the frontend node like this:
``` ```
python -m dynamo.frontend --router-mode kv python -m dynamo.frontend --router-mode kv
``` ```
The engine announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node. When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly.
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing. To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing.
The KV-aware routing arguments: The KV-aware routing arguments:
- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks). Defaults to 1. - `--kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.
- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 (default) recovers the deterministic behavior where the min logit is picked. - `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
- `--use-kv-events`/`--no-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true (default), then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events. - `--use-kv-events`/`--no-kv-events`: Determines how the router tracks cached blocks. When enabled (default), uses `KvIndexer` to monitor block creation and deletion events. When disabled, uses `ApproxKvIndexer`, which estimates cache hits based on a fixed time window (120s). Disable this if your backend doesn't support KV events.
- `--router-replica-sync`: Enables state synchronization between multiple router replicas via NATS. Disabled by default, and can be enabled by passing the flag in. When enabled, router replicas share their view of KV cache distribution and active sequences, allowing all routers to make optimal routing decisions even when requests are distributed across multiple router instances. This improves fault tolerance and routing accuracy in multi-router deployments. - `--router-replica-sync`: Enables NATS-based state synchronization between router replicas. When enabled, routers share their KV cache distribution and active sequence information, ensuring optimal routing decisions across multiple router instances. This improves fault tolerance and routing accuracy in distributed deployments. Disabled by default.
## Architecture ## Architecture
...@@ -107,29 +107,29 @@ Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/in ...@@ -107,29 +107,29 @@ Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/in
| | | | | |
| Cached: 2 blocks | Cached: 5 blocks | Cached: 8 blocks | Cached: 2 blocks | Cached: 5 blocks | Cached: 8 blocks
| Prefill: 8 blks | Prefill: 5 blks | Prefill: 2 blks | Prefill: 8 blks | Prefill: 5 blks | Prefill: 2 blks
| Decode: 10 blks | Decode: 7 blks | Decode: 9 blks | Decode: 10 blks | Decode: 5 blks | Decode: 9 blks
v v v v v v
+----------------+ +----------------+ +----------------+ +----------------+ +----------------+ +----------------+
| Worker 1 | | Worker 2 | | Worker 3 | | Worker 1 | | Worker 2 | | Worker 3 |
+----------------+ +----------------+ +----------------+ +----------------+ +----------------+ +----------------+
``` ```
Load balancing in LLM serving becomes complex when enabling KV Cache reuse. While KV Cache reuse can save significant computation, if the routing strategy is not aware of the unique KV states of each worker we can: KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- miss opportunities for KV Cache reuse if routing to the "wrong" node - Missed cache reuse opportunities due to suboptimal worker selection
- get into an imbalanced state where a few workers are processing many requests, lowering throughput of entire system - System throughput degradation from uneven request distribution across workers
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions: The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
### Cost Calculation ### Cost Calculation
1. **Prefill blocks**: The number of tokens that need to be processed during prefill is predicted based on the request's input tokens and the cached blocks available on each worker. This is divided by the block size to get the effective "prefill blocks". This prediction is updated when the first output token is produced, signaling prefill completion. 1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2. **Decode blocks**: The number of blocks needed during the decode phase is predicted based on the request's input tokens and the current active sequences on each worker. This is updated when the request is freed (blocks are dereferenced or freed). 2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks` 3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
- Lower cost is better - Lower costs indicate better routing choices
- The `overlap_score_weight` parameter controls the importance of cache hits vs. load balancing - `overlap_score_weight` balances cache hit optimization against load distribution
- A higher weight prioritizes cache reuse (better TTFT) while a lower weight prioritizes load distribution (better ITL) - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
### Worker Selection ### Worker Selection
...@@ -137,14 +137,12 @@ The router selects the worker with the lowest cost. When `router_temperature` is ...@@ -137,14 +137,12 @@ The router selects the worker with the lowest cost. When `router_temperature` is
Example calculation with `overlap_score_weight = 1.0`: Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18 - Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 7 = 12** (selected - lowest cost) - **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11 - Worker 3: cost = 1.0 * 2 + 9 = 11
## Events ## Events
In Dynamo, we support KV Cache Routing for many backends that have different implementations of KV Cache. To enable this, we built a KVPublisher that can be plugged into any framework to publish KV Events. Dynamo supports KV Cache Routing across multiple backend implementations through a flexible event system. The KVPublisher component integrates with any framework to emit KV events, while the KVIndexer component maintains a global prefix tree of cached blocks by processing these events from all workers.
On the receiving side we have a KVIndexer which accepts events from the KVPublisher and puts them into a global prefix tree for tracking cached blocks across all workers.
```text ```text
+----------------+ +-----------------+ +----------------+ +-----------------+
...@@ -174,13 +172,13 @@ The KVIndexer has a method `find_matches_for_request`, which takes in tokens and ...@@ -174,13 +172,13 @@ The KVIndexer has a method `find_matches_for_request`, which takes in tokens and
### Inter-Router Communication ### Inter-Router Communication
In multi-router deployments, each router only observes a subset of requests. To maintain a consistent global view of active sequences and KV cache states, routers broadcast their local actions to other replicas through three synchronization events: In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
1. **AddRequest**: Published when assigning a request to a worker, containing the request ID, worker ID, token sequence blocks, and overlap score. This updates other routers' tracking of which blocks are in use. 1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
2. **MarkPrefillCompleted**: Published when a request transitions from prefill to decode phase, signaling that prefill tokens should no longer count toward the worker's active prefill load. 2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
3. **Free**: Published when a request completes and its resources are released, allowing other routers to update their block reference counts. 3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
Each event includes a unique router ID to prevent processing of self-generated events. This asynchronous communication ensures all routers maintain synchronized KV cache state for optimal routing decisions despite handling different request streams. Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
...@@ -7,7 +7,7 @@ SPDX-License-Identifier: Apache-2.0 ...@@ -7,7 +7,7 @@ SPDX-License-Identifier: Apache-2.0
## Overview ## Overview
Dynamo's KV Router makes intelligent routing decisions by evaluating the computational cost of processing requests on different workers. The router considers both the decoding cost (active blocks) and prefill cost (new blocks that need to be computed). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in a distributed inference setup. The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
## Quick Start ## Quick Start
...@@ -22,10 +22,10 @@ This command: ...@@ -22,10 +22,10 @@ This command:
- Exposes the service on port 8080 (configurable) - Exposes the service on port 8080 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint - Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers can register themselves using the `register_llm` API, and the KV Router will automatically include them in its routing decisions. The router will: Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
- Track the state of all registered workers - Tracks the state of all registered workers
- Make intelligent routing decisions based on KV cache overlap - Makes routing decisions based on KV cache overlap
- Balance load across available workers - Balances load across available workers
### Important Arguments ### Important Arguments
...@@ -51,9 +51,9 @@ python -m dynamo.frontend --help ...@@ -51,9 +51,9 @@ python -m dynamo.frontend --help
The KV Router tracks two key metrics for each worker: The KV Router tracks two key metrics for each worker:
1. **Potential Active Blocks**: The total number of blocks that would be actively used for decoding if a request were routed to that worker. This includes existing active blocks plus new blocks from the incoming request. 1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
2. **Potential New Prefill Blocks**: The number of new tokens that would need to be prefilled (computed from scratch) on that worker, calculated as: 2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
- New prefill tokens = Total input tokens - (Overlap blocks × Block size) - New prefill tokens = Total input tokens - (Overlap blocks × Block size)
- Potential prefill blocks = New prefill tokens / Block size - Potential prefill blocks = New prefill tokens / Block size
...@@ -61,12 +61,12 @@ The KV Router tracks two key metrics for each worker: ...@@ -61,12 +61,12 @@ The KV Router tracks two key metrics for each worker:
The router maintains block information through two complementary systems: The router maintains block information through two complementary systems:
- **Active Decoding Blocks**: Tracked locally by the router based on the request lifecycle: - **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
- Incremented when a new request is added - Incremented when adding a new request
- Updated as new tokens are generated - Updated during token generation
- Decremented when a request completes - Decremented upon request completion
- **Cached Blocks**: Maintained globally by the KvIndexer, which builds a prefix tree from KV events reported by workers. This provides accurate overlap information for routing decisions. - **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
## Cost Function ## Cost Function
...@@ -96,18 +96,18 @@ The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance betw ...@@ -96,18 +96,18 @@ The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance betw
## KV Events vs. Approximation Mode ## KV Events vs. Approximation Mode
By default, the router uses KV events from workers to maintain an accurate global view of cached blocks. However, you can disable this with the `--no-kv-events` flag: The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
- **With KV Events (default)**: - **With KV Events (default)**:
- Accurate overlap calculation based on actual cached blocks - Calculates overlap accurately using actual cached blocks
- Higher accuracy but requires event processing overhead - Provides higher accuracy with event processing overhead
- Best for production deployments - Recommended for production deployments
- **Without KV Events (--no-kv-events)**: - **Without KV Events (--no-kv-events)**:
- Uses the ApproxKvIndexer to approximate cached blocks based on routing decisions - Uses ApproxKvIndexer to estimate cached blocks from routing decisions
- Assumes that recently routed requests will have their blocks cached - Assumes blocks from recent requests remain cached
- Lower overhead but potentially less accurate routing - Reduces overhead at the cost of routing accuracy
- Useful for testing or environments where event processing is a bottleneck - Suitable for testing or when event processing becomes a bottleneck
## Tuning Guidelines ## Tuning Guidelines
...@@ -138,9 +138,9 @@ The `router_temperature` parameter controls routing randomness: ...@@ -138,9 +138,9 @@ The `router_temperature` parameter controls routing randomness:
### 4. Iterative Optimization ### 4. Iterative Optimization
1. Start with default settings 1. Begin with default settings
2. Monitor TTFT and ITL metrics 2. Monitor TTFT and ITL metrics
3. Adjust `kv-overlap-score-weight` based on your optimization goals: 3. Adjust `kv-overlap-score-weight` to meet your performance goals:
- If TTFT is too high: Increase the weight - To reduce TTFT: Increase the weight
- If ITL is too high: Decrease the weight - To reduce ITL: Decrease the weight
4. Increase temperature if severe load imbalance occurs 4. If you observe severe load imbalance, increase the temperature setting
\ No newline at end of file \ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment