README.md 5.81 KB
Newer Older
Yan Ru Pei's avatar
Yan Ru Pei committed
1
2
3
4
5
6
7
8
9
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# KV Router

## Overview

10
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
Yan Ru Pei's avatar
Yan Ru Pei committed
11
12
13
14
15
16
17
18
19
20
21
22
23
24

## Quick Start

To launch the Dynamo frontend with the KV Router:

```bash
python -m dynamo.frontend --router-mode kv --http-port 8080
```

This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8080 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint

25
26
27
28
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
- Tracks the state of all registered workers
- Makes routing decisions based on KV cache overlap
- Balances load across available workers
Yan Ru Pei's avatar
Yan Ru Pei committed
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

### Important Arguments

The KV Router supports several key configuration options:

- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.

- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
  - `0.0`: Deterministic selection of the best worker
  - `> 0.0`: Probabilistic selection using softmax sampling
  - Higher values increase randomness, helping prevent worker saturation

- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
  - `--kv-events`: Uses real-time events from workers for accurate cache tracking
  - `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)

For a complete list of available options:
```bash
python -m dynamo.frontend --help
```

## KV Router Architecture

The KV Router tracks two key metrics for each worker:

54
1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
Yan Ru Pei's avatar
Yan Ru Pei committed
55

56
2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
Yan Ru Pei's avatar
Yan Ru Pei committed
57
58
59
60
61
62
63
   - New prefill tokens = Total input tokens - (Overlap blocks × Block size)
   - Potential prefill blocks = New prefill tokens / Block size

### Block Tracking Mechanisms

The router maintains block information through two complementary systems:

64
65
66
67
- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
  - Incremented when adding a new request
  - Updated during token generation
  - Decremented upon request completion
Yan Ru Pei's avatar
Yan Ru Pei committed
68

69
- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
Yan Ru Pei's avatar
Yan Ru Pei committed
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

## Cost Function

The KV Router's routing decision is based on a simple cost function:

```
logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
```

Where:
- Lower logit values are better (less computational cost)
- The router uses softmax sampling with optional temperature to select workers

### Key Parameter: kv-overlap-score-weight

The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:

- **Higher values (> 1.0)**: Emphasize reducing prefill cost
  - Prioritizes routing to workers with better cache hits
  - Optimizes for Time To First Token (TTFT)
  - Best for workloads where initial response latency is critical

- **Lower values (< 1.0)**: Emphasize decode performance
  - Distributes active decoding blocks more evenly
  - Optimizes for Inter-Token Latency (ITL)
  - Best for workloads with long generation sequences

## KV Events vs. Approximation Mode

99
The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
Yan Ru Pei's avatar
Yan Ru Pei committed
100
101

- **With KV Events (default)**:
102
103
104
  - Calculates overlap accurately using actual cached blocks
  - Provides higher accuracy with event processing overhead
  - Recommended for production deployments
Yan Ru Pei's avatar
Yan Ru Pei committed
105
106

- **Without KV Events (--no-kv-events)**:
107
108
109
110
  - Uses ApproxKvIndexer to estimate cached blocks from routing decisions
  - Assumes blocks from recent requests remain cached
  - Reduces overhead at the cost of routing accuracy
  - Suitable for testing or when event processing becomes a bottleneck
Yan Ru Pei's avatar
Yan Ru Pei committed
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

## Tuning Guidelines

### 1. Understand Your Workload Characteristics

- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`

### 2. Monitor Key Metrics

The router logs the cost calculation for each worker:
```
Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
```

This shows:
- Total cost (125.3)
- Overlap weight × prefill blocks (1.0 × 100.5)
- Active blocks (25.0)
- Cached blocks that contribute to overlap (15)

### 3. Temperature-Based Routing

The `router_temperature` parameter controls routing randomness:
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution

### 4. Iterative Optimization

141
1. Begin with default settings
Yan Ru Pei's avatar
Yan Ru Pei committed
142
2. Monitor TTFT and ITL metrics
143
144
145
146
3. Adjust `kv-overlap-score-weight` to meet your performance goals:
   - To reduce TTFT: Increase the weight
   - To reduce ITL: Decrease the weight
4. If you observe severe load imbalance, increase the temperature setting