"examples/backends/vllm/deploy/disagg_planner.yaml" did not exist on "7835904647d37c9eff25c2cea3801294a85c5cf2"
kv-router-ab-testing.md 29.3 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: KV Router A/B Testing
5
6
7
8
9
---

This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.

## Overview
10

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Dynamo's KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:

1. Deploy two identical Dynamo configurations:
   a. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITHOUT** KV Smart Router enabled
   b. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITH** KV Smart Router enabled
2. Run controlled benchmarks using AIPerf
3. Compare performance metrics to evaluate KV router effectiveness

**Prerequisites:** Kubernetes cluster with GPUs, kubectl, helm

---

## Prerequisites

### Required Tools

- `kubectl` (configured with cluster access)
- `helm` (v3+)
- HuggingFace account and token (if model downloads are gated)
- Kubernetes cluster with:
  - GPU nodes (H100, H200, or similar)
32
  - Sufficient GPU capacity (8+ GPUs recommended for this example)
33
34
35
36
37
38
39
40
41
42
43
44
  - Dynamo platform installed globally OR ability to install per-namespace

### Knowledge Requirements

- Basic Kubernetes concepts (namespaces, pods, services)
- Familiarity with LLM inference concepts
- Command-line proficiency

---

## Architecture

45
This guide uses a single namespace. We deploy one configuration (e.g. router-ON), run the benchmark, tear it down, then deploy the other (router-OFF) and run the same benchmark.
46
47

```text
48
49
50
51
52
53
54
55
56
57
58
59
60
61
┌──────────────────────────────────────────────┐
│ Namespace: dynamo-bench                       │
│ (one of A or B active at a time)              │
│                                              │
│  Deployment A: Router OFF                     │
│    ├─ Frontend (Standard Routing)              │
│    └─ 8x Decode Workers (1 GPU each)          │
│                                              │
│  Deployment B: Router ON                      │
│    ├─ Frontend (KV Smart Router)               │
│    └─ 8x Decode Workers (1 GPU each)          │
│                                              │
│  Benchmark Pod (AIPerf + Dataset)             │
└──────────────────────────────────────────────┘
62
63
64
65
66
67
68
69
```

**Key Difference:** Deployment B sets `DYN_ROUTER_MODE=kv` on the frontend to enable KV cache-aware routing.

---

## Phase 1: Namespace and Infrastructure Setup

70
### Step 1.1: Create Namespace
71
72

```bash
73
kubectl create namespace dynamo-bench
74
75
76
77
78
79
80
81
82
```

### Step 1.2: Create HuggingFace Token Secret (optional)

If the model you're seeking to deploy requires HF token to download (Llama family models require this), replace `YOUR_HF_TOKEN` with your actual HuggingFace token:

```bash
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="YOUR_HF_TOKEN" \
83
  -n dynamo-bench
84
85
```

86
### Step 1.3: Install Dynamo Platform
87

88
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in the workload namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation-guide.md) to install the platform in `dynamo-bench`.
89
90
91
92
93
94
95
96
97

**Key Configuration Notes:**
- If your cluster uses namespace restrictions, ensure `dynamo-operator.namespaceRestriction.enabled=true` is set during installation
- Adjust version tags to match your cluster's available Dynamo versions
- If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation

### Step 1.4: Verify Infrastructure

```bash
98
kubectl get pods -n dynamo-bench
99
100
```

101
Expect operator, etcd, and nats pods Running before deploying the graph.
102
103
104
105
106
107
108

---

## Phase 2: Deploy Model Serving

### Step 2.1: Create Deployment YAMLs

109
Create `router-off-deployment.yaml` (baseline):
110
111
112
113
114
115
116
117
118
119
120
121
122
123

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-agg-no-router
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-agg-no-router
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
124
125
126
127
128
129
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0
          env:
            - name: POD_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-agg-no-router
      componentType: worker
      replicas: 8
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node.kubernetes.io/instance-type
                      operator: In
                      values:
147
                        - gpu-h100-sxm  # Adjust to your GPU node type
148
        mainContainer:
149
150
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0
          workingDir: /workspace
151
152
153
154
          command:
            - /bin/sh
            - -c
          args:
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
            - >-
              python3 -m dynamo.vllm
              --model Qwen/Qwen3-32B
              --quantization fp8
              --kv-cache-dtype fp8
              --max-model-len 131072
              --hf-overrides '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
              --gpu-memory-utilization 0.90
              --block-size 64
              --async-scheduling
              --disable-log-requests
          env:
            - name: DYN_HEALTH_CHECK_ENABLED
              value: "false"
            - name: POD_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid
173
174
175
176
177
178
179
          startupProbe:
            httpGet:
              path: /health
              port: 9090
            initialDelaySeconds: 120
            periodSeconds: 30
            timeoutSeconds: 10
180
            failureThreshold: 60
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
          livenessProbe:
            httpGet:
              path: /live
              port: 9090
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 10
          readinessProbe:
            httpGet:
              path: /live
              port: 9090
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 10
197
      subComponentType: decode
198
199
```

200
Create `router-on-deployment.yaml` (KV router ON):
201
202
203
204
205
206
207
208
209
210
211
212
213
214

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-agg-router
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-agg-router
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
215
216
217
218
219
220
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0
          env:
            - name: POD_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
      envs:
        - name: DYN_ROUTER_MODE
          value: kv  # KEY DIFFERENCE: Enable KV Smart Router
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-agg-router
      componentType: worker
      replicas: 8
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node.kubernetes.io/instance-type
                      operator: In
                      values:
241
                        - gpu-h100-sxm  # Adjust to your GPU node type
242
        mainContainer:
243
244
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0
          workingDir: /workspace
245
246
247
248
          command:
            - /bin/sh
            - -c
          args:
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
            - >-
              python3 -m dynamo.vllm
              --model Qwen/Qwen3-32B
              --quantization fp8
              --kv-cache-dtype fp8
              --max-model-len 131072
              --hf-overrides '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
              --gpu-memory-utilization 0.90
              --block-size 64
              --async-scheduling
              --disable-log-requests
          env:
            - name: DYN_HEALTH_CHECK_ENABLED
              value: "false"
            - name: POD_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid
267
268
269
270
271
272
273
          startupProbe:
            httpGet:
              path: /health
              port: 9090
            initialDelaySeconds: 120
            periodSeconds: 30
            timeoutSeconds: 10
274
            failureThreshold: 60
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
          livenessProbe:
            httpGet:
              path: /live
              port: 9090
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 10
          readinessProbe:
            httpGet:
              path: /live
              port: 9090
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 10
291
      subComponentType: decode
292
293
```

294
### Step 2.2: Deploy Router-ON First
295
296

```bash
297
kubectl apply -f router-on-deployment.yaml -n dynamo-bench
298
299
300
301
```

**💡 Optimization Tip:** Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with `ReadWriteMany` access mode to cache the model.

302
303
304
305
306
First, create the PVC in the same namespace as your deployment (e.g. `dynamo-bench`). Use a storage class that supports ReadWriteMany:

```bash
kubectl get storageclass   # choose one with ReadWriteMany (e.g. azurefile-csi-premium, nfs, efs)
```
307
308
309
310
311
312

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
313
  namespace: dynamo-bench
314
315
316
spec:
  accessModes:
    - ReadWriteMany
317
  storageClassName: "azurefile-csi-premium"   # Adjust to your cluster
318
319
320
321
322
  resources:
    requests:
      storage: 100Gi
```

323
324
325
Apply it: `kubectl apply -f pvc-model-cache.yaml`

Then reference the existing PVC in your DynamoGraphDeployment by adding the following under `spec` (and under `VllmDecodeWorker`, add `volumeMounts`):
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340

```yaml
spec:
  pvcs:
    - create: false
      name: model-cache
      size: "0"
  services:
    VllmDecodeWorker:
      volumeMounts:
        - mountPoint: /root/.cache/huggingface
          name: model-cache
          useAsCompilationCache: false
```

341
With this configuration, the first run has one worker download; the rest load from cache. The main benefit is on redeploy: the model stays on the PVC, so new pods load from cache and come up in ~5–10 minutes instead of downloading again.
342
343
344
345

### Step 2.3: Monitor Deployment Progress

```bash
346
kubectl get pods -n dynamo-bench -w
347
348
349
350
351
352
353
354
355
```

Wait for all pods to reach `Running` status and pass readiness probes.

**Expected Timeline:**
- **With shared PVC** (ReadWriteMany): ~5-10 minutes total (first worker downloads, others reuse cache)
- **Without shared PVC**: 20-30 minutes per worker (workers download independently)
  - For 8 workers: Budget **1-2 hours** for full deployment (workers start in parallel but are limited by node scheduling)

356
The deployment's startup probe (`initialDelaySeconds: 120`, `periodSeconds: 30`, `failureThreshold: 60`) allows up to 32 minutes per pod for model download and initialization.
357

358
### Step 2.4: Verify Workers Are Healthy
359

360
> ⚠️ **CRITICAL CHECKPOINT**: Before running benchmarks, you **MUST** verify equal worker health. Unequal worker counts will invalidate your comparison results.
361
362

```bash
363
364
# Quick health check - should show "8/8"
echo "Workers: $(kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
365
366

# Detailed view
367
kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker
368
369
```

370
**All 8 must show `1/1 Running` and Ready.** Do not proceed until this is confirmed. Repeat this check after you tear down router-ON and deploy router-OFF (Phase 5).
371
372
373
374
375

---

## Phase 3: Prepare Benchmark Dataset

376
### Understanding the Mooncake Toolagent Trace
377

378
379
380
For this A/B comparison, we use the [**Mooncake FAST'25 Toolagent Trace**](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/traces/toolagent_trace.jsonl), published by [Mooncake AI](https://github.com/kvcache-ai/Mooncake) (USENIX FAST'25 Best Paper). This is a privacy-preserving dataset of real-world LLM inference traffic from production **tool-agent workloads** — AI agents that iteratively call tools and APIs while maintaining a growing conversation context. The trace contains **23,608 requests** spanning ~59 minutes of real-time traffic.

**Why the toolagent trace?** Tool-agent workloads are ideal for evaluating KV cache routing because each agent session involves repeated LLM calls that share a long, growing prefix (system prompt + conversation history + tool results), producing high natural prefix overlap between requests. The Mooncake toolagent trace captures these realistic patterns, letting us demonstrate the router's real-world performance gains.
381
382
383
384

**What's in the dataset?** Each trace entry contains:
- **Timestamp:** When the request arrived (for realistic request timing)
- **Input/output lengths:** Number of tokens in prompts and responses
385
- **Block hash IDs:** Cryptographic hashes representing KV cache blocks (no user text; explained below)
386

387
**Sample trace entries (showing prefix reuse):**
388
```json
389
390
{"timestamp": 0, "input_length": 9013, "output_length": 3, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}
{"timestamp": 0, "input_length": 6506, "output_length": 3, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 64]}
391
392
```

393
These two requests share blocks 46–57 (12 blocks × 512 tokens = ~6,144 tokens of shared prefix) — a tool agent continuing the same session with accumulated context. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks, preserving the pattern of prefix reuse while protecting user privacy. The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits.
394
395

**Key Dataset Properties:**
396
397
398
-**Realistic timing:** Request arrival patterns from production tool-agent workloads
-**High prefix overlap:** 59% cache ratio ([Mooncake FAST'25 paper](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/Mooncake-FAST25.pdf)); iterative tool calls within sessions produce natural prefix reuse
-**Privacy-preserving:** No actual text — only hash-based cache block identifiers
399
400
401
402
403
-**Reproducible:** Public dataset enables fair comparisons across different systems

### Download and Prepare the Dataset

```bash
404
405
# Download the Mooncake FAST'25 toolagent trace
curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl -o toolagent_trace.jsonl
406

407
# Slow down timestamps to 0.80× replay speed (~5.3 req/s instead of ~6.7 req/s)
408
409
410
python3 - <<'PY'
import json

411
with open("toolagent_trace.jsonl") as src, open("toolagent_trace_080x.jsonl", "w") as dst:
412
413
    for line in src:
        rec = json.loads(line)
414
        rec["timestamp"] = int(rec["timestamp"] / 0.80)
415
416
417
        dst.write(json.dumps(rec) + "\n")
PY

418
echo "Dataset ready: toolagent_trace_080x.jsonl (23,608 requests, 0.80x speed)"
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
```

---

## Phase 4: Set Up Benchmark Environment

### Step 4.1: Deploy Benchmark Pod

Create `benchmark-job.yaml`:

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: aiperf-benchmark
spec:
  backoffLimit: 1
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: benchmark
441
442
443
444
445
446
447
448
449
450
451
        image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0
        securityContext:
          runAsUser: 0  # Required: apt-get and pip install need root in ephemeral benchmark pod
        command:
          - /bin/bash
          - -lc
          - |
            apt-get update -qq && apt-get install -y -qq tmux > /dev/null 2>&1
            pip install -q aiperf==0.5.0
            echo "Benchmark pod ready (tmux + aiperf installed)."
            sleep infinity
452
453
454
455
456
457
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            nvidia.com/gpu: 0
```

458
459
This pod installs `tmux` and `aiperf` on startup so benchmarks can run inside a tmux session that survives `kubectl exec` disconnects.

460
461
462
Deploy:

```bash
463
kubectl apply -f benchmark-job.yaml -n dynamo-bench
464
465
```

466
Wait for pod to be ready (the init takes ~1-2 minutes to install packages):
467
468

```bash
469
kubectl get pods -n dynamo-bench -l job-name=aiperf-benchmark -w
470
471
472
473
474
```

### Step 4.2: Copy Dataset to Benchmark Pod

```bash
475
476
477
478
479
480
481
482
483
POD_NAME=$(kubectl get pods -n dynamo-bench -l job-name=aiperf-benchmark -o jsonpath='{.items[0].metadata.name}')
kubectl -n dynamo-bench cp toolagent_trace_080x.jsonl ${POD_NAME}:/tmp/toolagent_trace_080x.jsonl
```

---

## Phase 5: Run Benchmarks

### Step 5.1: Benchmark Router-ON
484

485
486
487
488
Verify the frontend service is reachable (the operator creates a service named `{deployment-name}-frontend`):

```bash
kubectl get svc -n dynamo-bench | grep frontend
489
490
```

491
Launch the benchmark inside a tmux session so it survives `kubectl exec` disconnects:
492
493

```bash
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
kubectl -n dynamo-bench exec ${POD_NAME} -- bash -c '
  tmux new-session -d -s benchmark ". /opt/dynamo/venv/bin/activate && \
    AIPERF_HTTP_CONNECTION_LIMIT=200 aiperf profile \
      -m Qwen/Qwen3-32B \
      --tokenizer Qwen/Qwen3-32B \
      --input-file /tmp/toolagent_trace_080x.jsonl \
      --custom-dataset-type mooncake_trace \
      --fixed-schedule \
      --url http://vllm-agg-router-frontend.dynamo-bench.svc.cluster.local:8000 \
      --streaming \
      --random-seed 42 \
      --workers-max 200 \
      --request-timeout-seconds 1000 \
      --profile-export-level records \
      --record-processors 8 \
      --artifact-dir /tmp/aiperf_router_on \
      --goodput \"time_to_first_token:5000 inter_token_latency:100\""
'
512
513
```

514
AIPerf writes the run to `/tmp/aiperf_router_on` on the pod (summary JSON and `profile_export.jsonl`).
515

516
517
518
### Monitoring Benchmarks

Benchmarks run inside a **tmux session** so they survive `kubectl exec` disconnects.
519

520
Attach to the live TUI (detach with **Ctrl+B then D**):
521
522

```bash
523
kubectl -n dynamo-bench exec -it ${POD_NAME} -- tmux a -t benchmark
524
525
```

526
527
528
529
530
531
532
533
### Step 5.2: Switch to Router-OFF and Benchmark

Tear down router-ON and deploy the baseline:

```bash
kubectl delete dynamographdeployment vllm-agg-router -n dynamo-bench
kubectl apply -f router-off-deployment.yaml -n dynamo-bench
```
534

535
Wait for 8/8 workers to be Ready again (re-run the health check from [Step 2.4](#step-24-verify-workers-are-healthy)), then clean up the previous tmux session and launch the baseline benchmark:
536
537

```bash
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
kubectl -n dynamo-bench exec ${POD_NAME} -- tmux kill-session -t benchmark 2>/dev/null

kubectl -n dynamo-bench exec ${POD_NAME} -- bash -c '
  tmux new-session -d -s benchmark ". /opt/dynamo/venv/bin/activate && \
    AIPERF_HTTP_CONNECTION_LIMIT=200 aiperf profile \
      -m Qwen/Qwen3-32B \
      --tokenizer Qwen/Qwen3-32B \
      --input-file /tmp/toolagent_trace_080x.jsonl \
      --custom-dataset-type mooncake_trace \
      --fixed-schedule \
      --url http://vllm-agg-no-router-frontend.dynamo-bench.svc.cluster.local:8000 \
      --streaming \
      --random-seed 42 \
      --workers-max 200 \
      --request-timeout-seconds 1000 \
      --profile-export-level records \
      --record-processors 8 \
      --artifact-dir /tmp/aiperf_router_off \
      --goodput \"time_to_first_token:5000 inter_token_latency:100\""
557
558
559
560
561
'
```

### Step 5.3: Collect Results

562
563
Copy the artifact directories (or the summary/export files inside them) to your machine:

564
```bash
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
kubectl -n dynamo-bench cp ${POD_NAME}:/tmp/aiperf_router_on ./aiperf_router_on
kubectl -n dynamo-bench cp ${POD_NAME}:/tmp/aiperf_router_off ./aiperf_router_off
```

Each artifact directory contains:
- `profile_export_aiperf.json` — summary with aggregated metrics (TTFT, latency percentiles, throughput)
- `profile_export.jsonl` — per-request records (one JSON object per completed request)

### Step 5.4: Quick Comparison

Extract and compare key metrics from the two summary files:

```bash
python3 -c "
import json, pathlib

def load(d):
    return json.loads(pathlib.Path(d, 'profile_export_aiperf.json').read_text())

on, off = load('aiperf_router_on'), load('aiperf_router_off')

metrics = [
    ('TTFT avg (ms)',             'time_to_first_token', 'avg'),
    ('TTFT p99 (ms)',             'time_to_first_token', 'p99'),
    ('E2E Latency avg (ms)',      'request_latency',     'avg'),
    ('E2E Latency p99 (ms)',      'request_latency',     'p99'),
    ('Output Throughput (tok/s)', 'output_token_throughput', 'avg'),
]

print(f\"{'Metric':<28} {'Router-OFF':>12} {'Router-ON':>12} {'Speedup':>10}\")
print('-' * 66)
for label, key, stat in metrics:
    v_off = off.get(key, {}).get(stat, 0)
    v_on  = on.get(key, {}).get(stat, 0)
    if 'throughput' in key.lower():
        speedup = v_on / v_off if v_off else 0
    else:
        speedup = v_off / v_on if v_on else 0
    print(f'{label:<28} {v_off:>12.1f} {v_on:>12.1f} {speedup:>9.1f}x')
"
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
```

---

## Phase 6: Analyze Results

### Key Metrics to Compare

| Metric | Description | What to Look For |
|--------|-------------|------------------|
| **Time to First Token (TTFT)** | Latency until first token arrives | Lower is better; KV router may reduce with prefix reuse |
| **Inter Token Latency (ITL)** | Average time between tokens | Lower is better; indicates generation speed |
| **Request Latency** | Total end-to-end latency | Lower is better; overall user experience |
| **Output Token Throughput** | Tokens generated per second (system-wide) | Higher is better; system efficiency |
| **Request Throughput** | Requests completed per second | Higher is better; capacity |

### Interpreting Results

**Your Results May Vary**: The improvement from KV Smart Router depends heavily on your workload characteristics:

**Factors that increase KV router benefit:**
- **High prefix overlap** (shared system prompts, templates, document contexts)
- **Long prompts** (>2000 tokens) where caching saves significant compute
- **Multi-turn conversations** with context carryover
- **Batch workloads** with similar queries

**Factors that reduce KV router benefit:**
- **Unique prompts** with no prefix reuse
633
- **Short prompts** (<1000 tokens) where routing overhead exceeds benefit
634
635
636
637
638
639
640
641
642
- **Evenly distributed load** where round-robin is already optimal
- **Low request rate** where cache eviction negates benefits

**KV Smart Router is beneficial when:**
- TTFT improvements > 20%
- No significant degradation in other metrics
- Workload demonstrates measurable prefix reuse patterns

**Standard routing is better when:**
643
- KV router shows <10% improvement
644
645
646
647
648
- Increased latency variance is observed
- Load distribution across workers is more important than cache affinity

### Example Comparison

649
From our Dynamo Operator benchmark with the full toolagent trace at 0.80× replay speed:
650

651
652
653
654
655
656
| Metric | Router-OFF (Baseline) | Router-ON (KV Router) | Improvement | Speedup |
|--------|----------------------|----------------------|-------------|---------|
| TTFT avg | 63,652 ms | 2,586 ms | **96% faster** | 24.6x ✅ |
| TTFT p99 | 332,974 ms | 17,871 ms | **95% faster** | 18.6x ✅ |
| E2E Latency avg | 92,856 ms | 19,112 ms | **79% faster** | 4.9x ✅ |
| E2E Latency p99 | 411,252 ms | 88,274 ms | **79% faster** | 4.7x ✅ |
657

658
659
660
661
In this example with all 8 workers healthy, the **KV router dramatically outperformed** the baseline:
- **96% faster TTFT** — Users see first token in ~2.6s instead of ~64s
- **79% lower E2E latency** — Requests complete in ~19s instead of ~93s
- **95% faster TTFT p99** — Tail latency drops from ~333s to ~18s
662

663
The toolagent trace has heavy prefix overlap from tool-agent sessions with repeated context. Without the KV router, requests with overlapping prefixes are scattered across workers, causing redundant recomputation and unbounded queue growth at high utilization. With the KV router, matching prefixes are routed to the same worker, maximizing cache hits and keeping latencies stable under load.
664
665
666
667
668
669

---

## Phase 7: Cleanup

```bash
670
671
672
kubectl delete dynamographdeployment --all -n dynamo-bench
kubectl delete job aiperf-benchmark -n dynamo-bench
kubectl delete namespace dynamo-bench
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
```

---

## Troubleshooting

### Issue: Pods Stuck in Pending

**Cause:** Insufficient GPU resources

**Solution:**
```bash
# Check GPU availability
kubectl describe nodes | grep -A 10 "Allocated resources"

# Reduce worker replicas if needed
689
kubectl edit dynamographdeployment -n dynamo-bench
690
691
692
693
694
695
696
697
698
```

### Issue: ImagePullBackOff Errors

**Cause:** Version mismatch or missing credentials

**Solution:**
```bash
# Check available versions
699
kubectl get pods -n dynamo-bench -o yaml | grep image:
700
701
702
703
704
705
706
707
708
709

# Update deployment YAML to match cluster version
```

### Issue: Operator Not Processing Deployment

**Cause:** Namespace restrictions

**Solution:**
- Ensure Dynamo platform is Helm-installed in the namespace
710
711
- Verify operator has `--restrictedNamespace=dynamo-bench` argument
- Check operator logs: `kubectl logs -n dynamo-bench deployment/dynamo-platform-dynamo-operator-controller-manager`
712
713
714
715
716
717
718
719

### Issue: Workers Not Becoming Ready

**Cause:** Model download failures or probe configuration

**Solution:**
```bash
# Check worker logs
720
kubectl logs -n dynamo-bench <worker-pod-name>
721
722
723
724
725
726
727
728
729

# Common issues:
# - Invalid HuggingFace token
# - Network connectivity
# - Insufficient disk space for model
```

### Issue: Workers Restarting in CrashLoopBackOff

730
**Cause:** Startup probe timeout — workers killed before finishing initialization
731
732
733
734
735
736

**Symptoms:**
- Pods show "Container main failed startup probe, will be restarted"
- Logs show model still downloading or loading when pod is killed

**Solution:**
737
The deployment YAMLs in this guide set `failureThreshold: 60`, allowing up to 32 minutes (`120s + 60×30s`). If you lowered this value or are using a larger model that needs more time, increase it:
738
739

```bash
740
741
kubectl patch dynamographdeployment <deployment-name> -n dynamo-bench --type='json' \
  -p='[{"op": "replace", "path": "/spec/services/VllmDecodeWorker/extraPodSpec/mainContainer/startupProbe/failureThreshold", "value": 80}]'
742
743
```

744
The relevant startup probe fields:
745
746
747
748
749
750
751
752
```yaml
startupProbe:
  httpGet:
    path: /health
    port: 9090
  initialDelaySeconds: 120
  periodSeconds: 30
  timeoutSeconds: 10
753
  failureThreshold: 60  # 32 minutes total (120s + 60*30s); increase for larger models
754
755
756
757
758
759
760
761
762
763
764
765
766
```

**Model Loading Times (approximate):**
- Qwen3-32B: ~20-25 minutes (first download)
- With cached model on node: ~2-5 minutes

### Issue: Unequal Worker Health

**Cause:** Resource constraints, image pull issues, or configuration errors

**Solution:**
```bash
# Check all worker status
767
kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker
768
769

# Describe problematic pods
770
kubectl describe pod <pod-name> -n dynamo-bench
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790

# Fix issues before benchmarking or results will be skewed
```

---

## Advanced Configuration

### Testing Different Models

Replace `Qwen/Qwen3-32B` with your model in:
- Deployment YAML `args` section
- AIPerf `--model` and `--tokenizer` parameters

### Adjusting Worker Count

Change `replicas: 8` in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.

### Using Custom Datasets

791
Replace the Mooncake trace with your own JSONL file:
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
- Format: One request per line with `timestamp` field
- AIPerf supports various formats via `--custom-dataset-type`

### Disaggregated Prefill/Decode

For advanced testing, add separate prefill workers:

```yaml
VllmPrefillWorker:
  componentType: worker
  replicas: 2
  # ... configuration
```

---

## Best Practices

1. **Equal Conditions:** Ensure both deployments have identical worker counts and health before benchmarking
2. **Warm-Up:** Run a small test (100 requests) before the full benchmark to warm up caches
3. **Multiple Runs:** Run benchmarks 3+ times and average results for statistical significance
4. **Monitor Workers:** Watch for any pod restarts or issues during benchmark runs
5. **Document Conditions:** Record cluster state, worker health, and any anomalies
815
6. **Consistent Configuration:** Use the same trace file and AIPerf options for both runs
816
817
818
819
820

---

## Conclusion

821
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see the [Tuning Guidelines](../components/router/router-guide.md#tuning-guidelines).
822
823
824
825
826
827
828
829
830
831

For questions or issues, consult the [Dynamo documentation](https://github.com/ai-dynamo/dynamo) or open an issue on GitHub.

---

## Appendix: Files Reference

- `router-off-deployment.yaml`: Standard routing deployment
- `router-on-deployment.yaml`: KV router enabled deployment
- `benchmark-job.yaml`: AIPerf benchmark pod
832
- AIPerf artifact dirs: summary JSON and `profile_export.jsonl` per run
833
834

**Repository:** [https://github.com/ai-dynamo/dynamo](https://github.com/ai-dynamo/dynamo)