README.md 11 KB
Newer Older
1
<!-- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2
# SPDX-License-Identifier: Apache-2.0 -->
Yan Ru Pei's avatar
Yan Ru Pei committed
3
4
5
6
7
8
9
10

# Router Benchmarking Guide

This directory contains scripts for benchmarking the Dynamo router with prefix caching. The benchmarks measure performance improvements from prefix sharing across requests.

## Prerequisites

- NVIDIA GPUs (8 GPUs for default configuration)
11
- (optional) H100 GPUs or later for gpt-oss-120b examples
Yan Ru Pei's avatar
Yan Ru Pei committed
12
13
14
15
- CUDA environment properly configured
- etcd and NATS running (required for Dynamo coordination)
- Required Python packages:
  - `dynamo` package (with vllm and frontend modules)
16
  - `aiperf` for benchmarking
Yan Ru Pei's avatar
Yan Ru Pei committed
17
  - `matplotlib` for plotting results
18
  - `data-generator` package (install with `pip install -e ./benchmarks` from repo root)
Yan Ru Pei's avatar
Yan Ru Pei committed
19

20
21
22
23
24
25
26
> [!Note]
> If running outside a container, set `DYNAMO_HOME` to the root path of your Dynamo repository:
> ```bash
> export DYNAMO_HOME=/path/to/dynamo
> ```
> When running in a container, this defaults to `/workspace`.

Yan Ru Pei's avatar
Yan Ru Pei committed
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
### Setting up etcd and NATS

This benchmark requires etcd and NATS. To quickly set them up, run:

```bash
# From the repository root:
docker compose -f deploy/docker-compose.yml up -d
```

This will start both etcd and NATS with the required configurations in the background.

## Scripts Overview

- **`run_engines.sh`** - Launches multiple vLLM worker instances
- **`ping.sh`** - Simple test script to verify the setup is working
- **`prefix_ratio_benchmark.py`** - Main benchmarking script that sweeps prefix ratios
43
- **`real_data_benchmark.py`** - Benchmarking script that uses real mooncake-style trace data
Yan Ru Pei's avatar
Yan Ru Pei committed
44
45
46
47
- **`plot_prefix_ratio_comparison.py`** - Generates comparison plots from benchmark results

## Usage Instructions

48
### Step 1: Launch Workers
Yan Ru Pei's avatar
Yan Ru Pei committed
49

50
51
52
53
54
55
Make sure you have 8 GPUs for these examples, unless you are using mockers (see below). First, start the worker engines in a terminal.

The script supports three modes:
- **`agg` (default)**: Aggregated/monolithic workers that handle both prefill and decode
- **`decode`**: Workers dedicated to decode (token generation) phase
- **`prefill`**: Workers dedicated to prefill (prompt processing) phase
Yan Ru Pei's avatar
Yan Ru Pei committed
56
57

```bash
58
# Default: 8 aggregated workers with DeepSeek model (handles both prefill and decode)
Yan Ru Pei's avatar
Yan Ru Pei committed
59
60
61
62
./run_engines.sh \
    --num-workers 8 \
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B

63
# Example: 4 workers with larger model using tensor parallelism (2 GPUs per worker)
64
# NOTE: this requires having Hopper or later GPU SKUs to support MXFP4 precision.
Yan Ru Pei's avatar
Yan Ru Pei committed
65
66
67
68
69
70
./run_engines.sh \
    --num-workers 4 \
    --model-path openai/gpt-oss-120b \
    --tensor-parallel-size 2
```

71
#### Disaggregated Serving (Decode + Prefill Workers)
Yan Ru Pei's avatar
Yan Ru Pei committed
72

73
You can launch separate decode and prefill workers for disaggregated serving. This allows you to dedicate specific GPUs to prefill (prompt processing) and decode (token generation) tasks:
Yan Ru Pei's avatar
Yan Ru Pei committed
74
75
76
77

```bash
# Launch 4 decode workers (GPUs 0-3)
./run_engines.sh \
78
    --decode \
Yan Ru Pei's avatar
Yan Ru Pei committed
79
80
81
82
83
    --num-workers 4 \
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B

# Launch 4 prefill workers (GPUs 4-7)
./run_engines.sh \
84
    --prefill \
Yan Ru Pei's avatar
Yan Ru Pei committed
85
86
87
88
89
    --num-workers 4 \
    --base-gpu-offset 4 \
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```

Yan Ru Pei's avatar
Yan Ru Pei committed
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#### Alternative: Launch vLLM Mock Workers

We also supports running lightweight mock engines that simulate vLLM behavior without performing actual model inference. Mocker engines are useful for testing router logic and performance without GPU requirements. Use the `--mockers` flag to run mocker engines instead of real vLLM workers.

```bash
# Example: Running mocker engines for testing (no GPU required)
./run_engines.sh --mockers \
    --num-workers 8 \
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --block-size 64 \
    --speedup-ratio 2.0
```

**Note**: The `--speedup-ratio` parameter controls the inference speed of mocker engines. A higher value (e.g., 2.0) makes the mocker engines simulate faster inference, allowing benchmarks to complete more quickly. This is particularly useful for testing router performance without waiting for realistic inference times.

### Step 2: Start the Router

In a **new terminal**, launch the Dynamo router using the Python CLI:

```bash
110
111
112
# Explicitly set NATS server for KV event publishing
export NATS_SERVER="${NATS_SERVER:-nats://localhost:4222}"

Yan Ru Pei's avatar
Yan Ru Pei committed
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
python -m dynamo.frontend \
    --router-mode kv \
    --router-reset-states \
    --http-port 8000
```

This starts the router with:
- KV cache routing mode
- `--router-reset-states` flag to clear the event cache (JetStream) from previous runs (useful for single router benchmarking)
- HTTP port 8000

To see all available router arguments, run:
```bash
python -m dynamo.frontend --help
```

129
For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md).
Yan Ru Pei's avatar
Yan Ru Pei committed
130

131
132
133
134
135
136
137
138
139
140
> [!Note]
> If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
>
> ```bash
> python -m dynamo.frontend \
>     --router-mode kv \
>     --http-port 8000 \
>     --no-kv-events
> ```

141
#### Disaggregated Serving with Automatic Prefill Routing
Yan Ru Pei's avatar
Yan Ru Pei committed
142

143
144
When you launch prefill workers using `run_engines.sh --prefill`, the frontend automatically detects them and activates an internal prefill router. This prefill router:
- Automatically routes initial token processing to dedicated prefill workers
145
- Uses the same routing mode as the frontend's `--router-mode` setting
146
- Seamlessly integrates with your decode workers for token generation
Yan Ru Pei's avatar
Yan Ru Pei committed
147

148
No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details.
Yan Ru Pei's avatar
Yan Ru Pei committed
149

150
151
> [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)
Yan Ru Pei's avatar
Yan Ru Pei committed
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198

### Step 3: Verify Setup

In another terminal, test that everything is working:

```bash
./ping.sh
# Or specify a different port:
./ping.sh 8000
```

This sends a simple test request to the router. You should see a streamed response if everything is configured correctly.

### Step 4: Run Benchmarks

Once the setup is verified, run the prefix ratio benchmark:

```bash
python prefix_ratio_benchmark.py
```

Default configuration:
- Tests prefix ratios: 0.5 (can be customized with `--prefix-ratios 0.1 0.3 0.5 0.7 0.9`)
- Input sequence length: 14000 tokens
- Output sequence length: 200 tokens
- Requests: 200
- Concurrency: 20

You can customize the benchmark:

```bash
# Test multiple prefix ratios
python prefix_ratio_benchmark.py --prefix-ratios 0.1 0.3 0.5 0.7 0.9

# Adjust input/output lengths
python prefix_ratio_benchmark.py --isl 10000 --osl 500

# Change request count and concurrency
python prefix_ratio_benchmark.py --requests 500 --concurrency 50

# Use multiple router endpoints for parallel benchmarking (for testing multiple Router replicas)
python prefix_ratio_benchmark.py --url http://localhost:8000 http://localhost:8001

# Specify output directory
python prefix_ratio_benchmark.py --output-dir results/experiment1
```

199
### Step 4 (Alternative): Run Benchmarks with Real Trace Data
Yan Ru Pei's avatar
Yan Ru Pei committed
200

201
202
203
204
205
206
207
208
209
Instead of synthetic benchmarks with controlled prefix ratios, you can benchmark using real trace data. This approach uses actual request patterns from production traces, potentially modified with synthesis parameters.

First, download the mooncake trace dataset:

```bash
wget https://raw.githubusercontent.com/kvcache-ai/Mooncake/d21da178bae8db9651cf18a76824c084145fc725/mooncake_trace.jsonl
```

Then run the benchmark:
Yan Ru Pei's avatar
Yan Ru Pei committed
210

211
```bash
212
python real_data_benchmark.py --input-dataset mooncake_trace.jsonl
213
214
```

215
The script can apply various modifications on top of the original trace dataset to simulate different scenarios and workload conditions. This script accepts the same synthesis parameters as the [prefix data generator](../prefix_data_generator/README.md):
Yan Ru Pei's avatar
Yan Ru Pei committed
216

217
218
219
220
221
222
223
**Key parameters:**
- `--num-requests`: Number of requests to synthesize from the trace (default: use all)
- `--speedup-ratio`: Speed up request arrival times (e.g., 2.0 makes requests arrive 2x faster)
- `--prefix-len-multiplier`: Scale the length of shared prefixes (e.g., 2.0 doubles prefix lengths)
- `--prefix-root-multiplier`: Replicate the prefix tree structure N times with different roots
- `--prompt-len-multiplier`: Scale the length of unique user prompts (e.g., 0.5 for shorter prompts)
- `--max-isl`: Filter out requests exceeding this input sequence length
Yan Ru Pei's avatar
Yan Ru Pei committed
224

225
226
227
Examples:

```bash
228
229
# Use original trace dataset as-is (no synthesis parameters specified)
python real_data_benchmark.py --input-dataset trace.jsonl
230
231

# Speed up request rate by 2x and use only first 1000 requests
232
python real_data_benchmark.py --input-dataset trace.jsonl --num-requests 1000 --speedup-ratio 2.0
233
234

# Double prefix lengths to test cache efficiency with longer shared contexts
235
python real_data_benchmark.py --input-dataset trace.jsonl --prefix-len-multiplier 2.0
236
237

# Create more diverse workload by replicating prefix tree 3 times
238
python real_data_benchmark.py --input-dataset trace.jsonl --prefix-root-multiplier 3
239
```
Yan Ru Pei's avatar
Yan Ru Pei committed
240

241
> [!Note]
242
> At the time of writing this documentation, you may need to install the latest aiperf from the main source branch to loadgen on the trace files:
243
> ```bash
244
> pip install git+https://github.com/ai-dynamo/aiperf.git
245
> ```
246
> However, by the time of release, the aiperf version included in the vLLM runtime container should be up to date enough to use as-is.
247

248
249
250
251
252
253
254
255
256
257
258
259
260
## Benchmarking Results

We benchmarked the Dynamo KV Router against a baseline round-robin routing strategy to evaluate the performance benefits of cache-aware routing. The experiments were conducted using deepseek-ai/DeepSeek-R1-Distill-Llama-8B on 8 L40S GPUs under aggregated serving, with the following configuration:

- **ISL/OSL**: 14000/200
- **Prefix Ratios**: 0.1, 0.3, 0.5, 0.7, 0.9
- **Workload**: 200 requests organized into 20 prefix groups
- **Concurrency**: 20 concurrent requests

![Router Performance Comparison](results.png)

The results demonstrate that the Dynamo KV Router consistently outperforms round-robin routing across all prefix ratio settings, with performance gains increasing as the prefix ratio grows. This highlights the importance of cache-aware routing for workloads with significant prefix sharing such as multi-turn conversations, document Q&A, and prompt engineering iterations.

Yan Ru Pei's avatar
Yan Ru Pei committed
261
262
263
264
265
266
## Troubleshooting

1. **Workers fail to start**: Check CUDA_VISIBLE_DEVICES and GPU availability
2. **Router connection refused**: Ensure router is running and port is correct
3. **Benchmark timeout**: Decrease concurrency or reduce request count
4. **OOM errors**: Reduce max-num-batched-tokens or max-model-len in run_engines.sh