README.md 12.9 KB
Newer Older
Yan Ru Pei's avatar
Yan Ru Pei committed
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Yan Ru Pei's avatar
Yan Ru Pei committed
3
4
5
6
7
8
9
SPDX-License-Identifier: Apache-2.0
-->

# KV Router

## Overview

10
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
Yan Ru Pei's avatar
Yan Ru Pei committed
11

12
13
14
## Quick Start

### Python / CLI Deployment
Yan Ru Pei's avatar
Yan Ru Pei committed
15
16
17
18

To launch the Dynamo frontend with the KV Router:

```bash
19
python -m dynamo.frontend --router-mode kv --http-port 8000
Yan Ru Pei's avatar
Yan Ru Pei committed
20
21
22
23
```

This command:
- Launches the Dynamo frontend service with KV routing enabled
24
- Exposes the service on port 8000 (configurable)
Yan Ru Pei's avatar
Yan Ru Pei committed
25
26
- Automatically handles all backend workers registered to the Dynamo endpoint

27
28
29
30
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
- Tracks the state of all registered workers
- Makes routing decisions based on KV cache overlap
- Balances load across available workers
Yan Ru Pei's avatar
Yan Ru Pei committed
31

32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
### Kubernetes Deployment

To enable the KV Router in a Kubernetes deployment, add the `DYN_ROUTER_MODE` environment variable to your frontend service:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-deployment
spec:
  services:
    Frontend:
      dynamoNamespace: my-namespace
      componentType: frontend
      replicas: 1
      envs:
        - name: DYN_ROUTER_MODE
          value: kv  # Enable KV Smart Router
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
    Worker:
      # ... worker configuration ...
```

**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed

**Complete K8s Examples:**
63
64
65
- [TRT-LLM aggregated router example](../../examples/backends/trtllm/deploy/agg_router.yaml)
- [vLLM aggregated router example](../../examples/backends/vllm/deploy/agg_router.yaml)
- [SGLang aggregated router example](../../examples/backends/sglang/deploy/agg_router.yaml)
66
67
68
69
70
71
72
73
- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)

**For A/B Testing and Advanced K8s Setup:**
See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.

## Configuration Options

### CLI Arguments (Python Deployment)
Yan Ru Pei's avatar
Yan Ru Pei committed
74
75
76

The KV Router supports several key configuration options:

77
78
- **`--router-mode kv`**: Enable KV cache-aware routing (required)

Yan Ru Pei's avatar
Yan Ru Pei committed
79
80
81
82
83
84
85
86
87
88
89
- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.

- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
  - `0.0`: Deterministic selection of the best worker
  - `> 0.0`: Probabilistic selection using softmax sampling
  - Higher values increase randomness, helping prevent worker saturation

- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
  - `--kv-events`: Uses real-time events from workers for accurate cache tracking
  - `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)

90
91
92
93
- **`--kv-overlap-score-weight <float>`**: Balance between prefill and decode optimization (default: 1.0)
  - Higher values (> 1.0): Prioritize reducing prefill cost (better TTFT)
  - Lower values (< 1.0): Prioritize decode performance (better ITL)

Yan Ru Pei's avatar
Yan Ru Pei committed
94
95
96
97
98
For a complete list of available options:
```bash
python -m dynamo.frontend --help
```

99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
### Kubernetes Environment Variables

All CLI arguments can be configured via environment variables in Kubernetes deployments. Use the `DYN_` prefix with uppercase parameter names:

| CLI Argument | K8s Environment Variable | Default | Description |
|--------------|-------------------------|---------|-------------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` | Enable KV router |
| `--router-temperature <float>` | `DYN_ROUTER_TEMPERATURE=<float>` | `0.0` | Routing randomness |
| `--kv-cache-block-size <size>` | `DYN_KV_CACHE_BLOCK_SIZE=<size>` | Backend-specific | KV cache block size |
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` | Disable KV event tracking |
| `--kv-overlap-score-weight <float>` | `DYN_KV_OVERLAP_SCORE_WEIGHT=<float>` | `1.0` | Prefill vs decode weight |
| `--http-port <port>` | `DYN_HTTP_PORT=<port>` | `8000` | HTTP server port |

### Example with Advanced Configuration

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-deployment
spec:
  services:
    Frontend:
      dynamoNamespace: my-namespace
      componentType: frontend
      replicas: 1
      envs:
        - name: DYN_ROUTER_MODE
          value: kv
        - name: DYN_ROUTER_TEMPERATURE
          value: "0.5"  # Add some randomness to prevent worker saturation
        - name: DYN_KV_OVERLAP_SCORE_WEIGHT
          value: "1.5"  # Prioritize TTFT over ITL
        - name: DYN_KV_CACHE_BLOCK_SIZE
          value: "16"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
```

### Alternative: Using Command Args in K8s

You can also pass CLI arguments directly in the container command:

```yaml
extraPodSpec:
  mainContainer:
    image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
    command:
      - /bin/sh
      - -c
    args:
      - "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
```

**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.

Yan Ru Pei's avatar
Yan Ru Pei committed
156
157
158
159
## KV Router Architecture

The KV Router tracks two key metrics for each worker:

160
1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
Yan Ru Pei's avatar
Yan Ru Pei committed
161

162
2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
Yan Ru Pei's avatar
Yan Ru Pei committed
163
164
165
166
167
168
169
   - New prefill tokens = Total input tokens - (Overlap blocks × Block size)
   - Potential prefill blocks = New prefill tokens / Block size

### Block Tracking Mechanisms

The router maintains block information through two complementary systems:

170
171
172
173
- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
  - Incremented when adding a new request
  - Updated during token generation
  - Decremented upon request completion
Yan Ru Pei's avatar
Yan Ru Pei committed
174

175
- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
Yan Ru Pei's avatar
Yan Ru Pei committed
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204

## Cost Function

The KV Router's routing decision is based on a simple cost function:

```
logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
```

Where:
- Lower logit values are better (less computational cost)
- The router uses softmax sampling with optional temperature to select workers

### Key Parameter: kv-overlap-score-weight

The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:

- **Higher values (> 1.0)**: Emphasize reducing prefill cost
  - Prioritizes routing to workers with better cache hits
  - Optimizes for Time To First Token (TTFT)
  - Best for workloads where initial response latency is critical

- **Lower values (< 1.0)**: Emphasize decode performance
  - Distributes active decoding blocks more evenly
  - Optimizes for Inter-Token Latency (ITL)
  - Best for workloads with long generation sequences

## KV Events vs. Approximation Mode

205
The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
Yan Ru Pei's avatar
Yan Ru Pei committed
206
207

- **With KV Events (default)**:
208
209
210
  - Calculates overlap accurately using actual cached blocks
  - Provides higher accuracy with event processing overhead
  - Recommended for production deployments
Yan Ru Pei's avatar
Yan Ru Pei committed
211
212

- **Without KV Events (--no-kv-events)**:
213
214
  - Router predicts cache state based on routing decisions with TTL-based expiration and pruning
  - Tracks blocks from recent requests with configurable time-to-live
215
  - Reduces overhead at the cost of routing accuracy
216
  - **NATS is not needed** - suitable for simpler deployments without NATS infrastructure
217
  - Suitable for testing or when event processing becomes a bottleneck
Yan Ru Pei's avatar
Yan Ru Pei committed
218

219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
## Event Transport Modes

The router supports two event transport modes for KV cache state synchronization:

- **JetStream (default)**: Persistent event stream with durable consumers. State persists across router restarts via snapshots in NATS object store. Best for production with multi-replica consistency.

- **NATS Core with Local Indexer** (`--enable-local-indexer` on workers): Fire-and-forget pub/sub where workers maintain local radix trees. Router rebuilds state by querying workers on startup. Lower latency, simpler setup.

See [KV Cache Routing](kv_cache_routing.md#global-kv-cache-state-synchronization) for architecture diagrams and details.

## Disaggregated Serving

Dynamo supports disaggregated serving where prefill and decode are handled by separate worker pools. Register prefill workers with `ModelType.Prefill` and the frontend automatically activates an internal prefill router.

Key points:
- Prefill router auto-activates when both prefill and decode workers register with the same model name
- Supports vLLM and TensorRT-LLM backends (SGLang requires separate router setup)
- Use `--no-track-active-blocks` for prefill-only workers

See [KV Cache Routing - Disaggregated Serving](kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for setup examples.

## Router Replicas and State Persistence

For high availability, run multiple router replicas with `--router-replica-sync` to synchronize active block tracking via NATS.

State persistence options:
- **JetStream mode**: Automatic persistence via event stream and object store snapshots
- **Local Indexer mode**: State rebuilds from workers on startup
- **Reset state**: Use `--router-reset-states` to start fresh (use with caution)

See [KV Cache Routing - Serving Multiple Router Replicas](kv_cache_routing.md#serving-multiple-router-replicas) for details.

## Busy Thresholds

Control worker saturation with busy thresholds:
- `--active-decode-blocks-threshold <0.0-1.0>`: Mark workers busy when KV cache utilization exceeds threshold
- `--active-prefill-tokens-threshold <count>`: Mark workers busy when active prefill tokens exceed threshold

Thresholds can be updated at runtime via the `/busy_threshold` HTTP endpoint. See [Dynamic Threshold Configuration](kv_cache_routing.md#dynamic-threshold-configuration).

## Python API

For programmatic routing control, use the `KvPushRouter` class directly:

```python
from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig

router = KvPushRouter(endpoint=endpoint, block_size=16, kv_router_config=KvRouterConfig())
stream = await router.generate(token_ids=tokens, model="model-name")
```

Key methods: `generate()`, `best_worker()`, `get_potential_loads()`, `mark_prefill_complete()`, `free()`.

See [KV Cache Routing - Python API](kv_cache_routing.md#using-kvpushrouter-python-api) for complete examples.

## Prerequisites and Limitations

- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`
- **No multimodal support**: Currently tracks token-based blocks only
- **No static endpoints**: Use `--router-mode round-robin` for static endpoint deployments

See [KV Cache Routing - Prerequisites](kv_cache_routing.md#prerequisites-and-limitations) for details.

Yan Ru Pei's avatar
Yan Ru Pei committed
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
## Tuning Guidelines

### 1. Understand Your Workload Characteristics

- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`

### 2. Monitor Key Metrics

The router logs the cost calculation for each worker:
```
Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
```

This shows:
- Total cost (125.3)
- Overlap weight × prefill blocks (1.0 × 100.5)
- Active blocks (25.0)
- Cached blocks that contribute to overlap (15)

### 3. Temperature-Based Routing

The `router_temperature` parameter controls routing randomness:
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution

### 4. Iterative Optimization

311
1. Begin with default settings
Yan Ru Pei's avatar
Yan Ru Pei committed
312
2. Monitor TTFT and ITL metrics
313
314
315
3. Adjust `kv-overlap-score-weight` to meet your performance goals:
   - To reduce TTFT: Increase the weight
   - To reduce ITL: Decrease the weight
316
4. If you observe severe load imbalance, increase the temperature setting