kv-router-ab-testing.md 29.6 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: KV Router A/B Testing
5
6
7
8
9
---

This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.

## Overview
10

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Dynamo's KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:

1. Deploy two identical Dynamo configurations:
   a. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITHOUT** KV Smart Router enabled
   b. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITH** KV Smart Router enabled
2. Run controlled benchmarks using AIPerf
3. Compare performance metrics to evaluate KV router effectiveness

**Prerequisites:** Kubernetes cluster with GPUs, kubectl, helm

---

## Prerequisites

### Required Tools

- `kubectl` (configured with cluster access)
- `helm` (v3+)
- HuggingFace account and token (if model downloads are gated)
- Kubernetes cluster with:
  - GPU nodes (H100, H200, or similar)
32
  - Sufficient GPU capacity (8+ GPUs recommended for this example)
33
34
35
36
37
38
39
40
41
42
43
44
  - Dynamo platform installed globally OR ability to install per-namespace

### Knowledge Requirements

- Basic Kubernetes concepts (namespaces, pods, services)
- Familiarity with LLM inference concepts
- Command-line proficiency

---

## Architecture

45
This guide uses a single namespace. We deploy one configuration (e.g. router-ON), run the benchmark, tear it down, then deploy the other (router-OFF) and run the same benchmark.
46
47

```text
48
49
50
51
52
53
54
55
56
57
58
59
60
61
┌──────────────────────────────────────────────┐
│ Namespace: dynamo-bench                       │
│ (one of A or B active at a time)              │
│                                              │
│  Deployment A: Router OFF                     │
│    ├─ Frontend (Standard Routing)              │
│    └─ 8x Decode Workers (1 GPU each)          │
│                                              │
│  Deployment B: Router ON                      │
│    ├─ Frontend (KV Smart Router)               │
│    └─ 8x Decode Workers (1 GPU each)          │
│                                              │
│  Benchmark Pod (AIPerf + Dataset)             │
└──────────────────────────────────────────────┘
62
63
64
65
66
67
68
69
```

**Key Difference:** Deployment B sets `DYN_ROUTER_MODE=kv` on the frontend to enable KV cache-aware routing.

---

## Phase 1: Namespace and Infrastructure Setup

70
### Step 1.1: Create Namespace
71
72

```bash
73
kubectl create namespace dynamo-bench
74
75
76
77
78
79
80
81
82
```

### Step 1.2: Create HuggingFace Token Secret (optional)

If the model you're seeking to deploy requires HF token to download (Llama family models require this), replace `YOUR_HF_TOKEN` with your actual HuggingFace token:

```bash
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="YOUR_HF_TOKEN" \
83
  -n dynamo-bench
84
85
```

86
### Step 1.3: Install Dynamo Platform
87

88
89
90
Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation-guide.md) to install the platform in `dynamo-bench`.

> **Note:** Namespace-restricted mode (`namespaceRestriction.enabled=true`) is deprecated and will be removed in a future release. Use cluster-wide mode for new deployments.
91
92
93
94
95
96
97
98

**Key Configuration Notes:**
- Adjust version tags to match your cluster's available Dynamo versions
- If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation

### Step 1.4: Verify Infrastructure

```bash
99
kubectl get pods -n dynamo-bench
100
101
```

102
Expect operator, etcd, and nats pods Running before deploying the graph.
103
104
105
106
107
108
109

---

## Phase 2: Deploy Model Serving

### Step 2.1: Create Deployment YAMLs

110
Create `router-off-deployment.yaml` (baseline):
111
112
113
114
115
116
117
118
119
120
121
122
123
124

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-agg-no-router
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-agg-no-router
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
125
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
126
127
128
129
130
          env:
            - name: POD_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-agg-no-router
      componentType: worker
      replicas: 8
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node.kubernetes.io/instance-type
                      operator: In
                      values:
148
                        - gpu-h100-sxm  # Adjust to your GPU node type
149
        mainContainer:
150
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
151
          workingDir: /workspace
152
153
154
155
          command:
            - /bin/sh
            - -c
          args:
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
            - >-
              python3 -m dynamo.vllm
              --model Qwen/Qwen3-32B
              --quantization fp8
              --kv-cache-dtype fp8
              --max-model-len 131072
              --hf-overrides '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
              --gpu-memory-utilization 0.90
              --block-size 64
              --async-scheduling
              --disable-log-requests
          env:
            - name: DYN_HEALTH_CHECK_ENABLED
              value: "false"
            - name: POD_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid
174
175
176
177
178
179
180
          startupProbe:
            httpGet:
              path: /health
              port: 9090
            initialDelaySeconds: 120
            periodSeconds: 30
            timeoutSeconds: 10
181
            failureThreshold: 60
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
          livenessProbe:
            httpGet:
              path: /live
              port: 9090
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 10
          readinessProbe:
            httpGet:
              path: /live
              port: 9090
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 10
198
      subComponentType: decode
199
200
```

201
Create `router-on-deployment.yaml` (KV router ON):
202
203
204
205
206
207
208
209
210
211
212
213
214
215

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-agg-router
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-agg-router
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
216
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
217
218
219
220
221
          env:
            - name: POD_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
      envs:
        - name: DYN_ROUTER_MODE
          value: kv  # KEY DIFFERENCE: Enable KV Smart Router
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-agg-router
      componentType: worker
      replicas: 8
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: node.kubernetes.io/instance-type
                      operator: In
                      values:
242
                        - gpu-h100-sxm  # Adjust to your GPU node type
243
        mainContainer:
244
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
245
          workingDir: /workspace
246
247
248
249
          command:
            - /bin/sh
            - -c
          args:
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
            - >-
              python3 -m dynamo.vllm
              --model Qwen/Qwen3-32B
              --quantization fp8
              --kv-cache-dtype fp8
              --max-model-len 131072
              --hf-overrides '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
              --gpu-memory-utilization 0.90
              --block-size 64
              --async-scheduling
              --disable-log-requests
          env:
            - name: DYN_HEALTH_CHECK_ENABLED
              value: "false"
            - name: POD_UID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid
268
269
270
271
272
273
274
          startupProbe:
            httpGet:
              path: /health
              port: 9090
            initialDelaySeconds: 120
            periodSeconds: 30
            timeoutSeconds: 10
275
            failureThreshold: 60
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
          livenessProbe:
            httpGet:
              path: /live
              port: 9090
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 10
          readinessProbe:
            httpGet:
              path: /live
              port: 9090
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 10
292
      subComponentType: decode
293
294
```

295
### Step 2.2: Deploy Router-ON First
296
297

```bash
298
kubectl apply -f router-on-deployment.yaml -n dynamo-bench
299
300
301
302
```

**💡 Optimization Tip:** Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with `ReadWriteMany` access mode to cache the model.

303
304
305
306
307
First, create the PVC in the same namespace as your deployment (e.g. `dynamo-bench`). Use a storage class that supports ReadWriteMany:

```bash
kubectl get storageclass   # choose one with ReadWriteMany (e.g. azurefile-csi-premium, nfs, efs)
```
308
309
310
311
312
313

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
314
  namespace: dynamo-bench
315
316
317
spec:
  accessModes:
    - ReadWriteMany
318
  storageClassName: "azurefile-csi-premium"   # Adjust to your cluster
319
320
321
322
323
  resources:
    requests:
      storage: 100Gi
```

324
325
326
Apply it: `kubectl apply -f pvc-model-cache.yaml`

Then reference the existing PVC in your DynamoGraphDeployment by adding the following under `spec` (and under `VllmDecodeWorker`, add `volumeMounts`):
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341

```yaml
spec:
  pvcs:
    - create: false
      name: model-cache
      size: "0"
  services:
    VllmDecodeWorker:
      volumeMounts:
        - mountPoint: /root/.cache/huggingface
          name: model-cache
          useAsCompilationCache: false
```

342
With this configuration, the first run has one worker download; the rest load from cache. The main benefit is on redeploy: the model stays on the PVC, so new pods load from cache and come up in ~5–10 minutes instead of downloading again.
343
344
345
346

### Step 2.3: Monitor Deployment Progress

```bash
347
kubectl get pods -n dynamo-bench -w
348
349
350
351
352
353
354
355
356
```

Wait for all pods to reach `Running` status and pass readiness probes.

**Expected Timeline:**
- **With shared PVC** (ReadWriteMany): ~5-10 minutes total (first worker downloads, others reuse cache)
- **Without shared PVC**: 20-30 minutes per worker (workers download independently)
  - For 8 workers: Budget **1-2 hours** for full deployment (workers start in parallel but are limited by node scheduling)

357
The deployment's startup probe (`initialDelaySeconds: 120`, `periodSeconds: 30`, `failureThreshold: 60`) allows up to 32 minutes per pod for model download and initialization.
358

359
### Step 2.4: Verify Workers Are Healthy
360

361
> ⚠️ **CRITICAL CHECKPOINT**: Before running benchmarks, you **MUST** verify equal worker health. Unequal worker counts will invalidate your comparison results.
362
363

```bash
364
365
# Quick health check - should show "8/8"
echo "Workers: $(kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
366
367

# Detailed view
368
kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker
369
370
```

371
**All 8 must show `1/1 Running` and Ready.** Do not proceed until this is confirmed. Repeat this check after you tear down router-ON and deploy router-OFF (Phase 5).
372
373
374
375
376

---

## Phase 3: Prepare Benchmark Dataset

377
### Understanding the Mooncake Toolagent Trace
378

379
380
381
For this A/B comparison, we use the [**Mooncake FAST'25 Toolagent Trace**](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/traces/toolagent_trace.jsonl), published by [Mooncake AI](https://github.com/kvcache-ai/Mooncake) (USENIX FAST'25 Best Paper). This is a privacy-preserving dataset of real-world LLM inference traffic from production **tool-agent workloads** — AI agents that iteratively call tools and APIs while maintaining a growing conversation context. The trace contains **23,608 requests** spanning ~59 minutes of real-time traffic.

**Why the toolagent trace?** Tool-agent workloads are ideal for evaluating KV cache routing because each agent session involves repeated LLM calls that share a long, growing prefix (system prompt + conversation history + tool results), producing high natural prefix overlap between requests. The Mooncake toolagent trace captures these realistic patterns, letting us demonstrate the router's real-world performance gains.
382
383
384
385

**What's in the dataset?** Each trace entry contains:
- **Timestamp:** When the request arrived (for realistic request timing)
- **Input/output lengths:** Number of tokens in prompts and responses
386
- **Block hash IDs:** Cryptographic hashes representing KV cache blocks (no user text; explained below)
387

388
**Sample trace entries (showing prefix reuse):**
389
```json
390
391
{"timestamp": 0, "input_length": 9013, "output_length": 3, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}
{"timestamp": 0, "input_length": 6506, "output_length": 3, "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 64]}
392
393
```

394
These two requests share blocks 46–57 (12 blocks × 512 tokens = ~6,144 tokens of shared prefix) — a tool agent continuing the same session with accumulated context. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks, preserving the pattern of prefix reuse while protecting user privacy. The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits.
395

396
397
398
399
400
401
402
If you reproduce this benchmark with `python -m dynamo.replay`, keep that dataset fact separate from
the replay engine configuration:

- use `--trace-block-size 512` for the Mooncake/toolagent trace itself
- keep engine `block_size` in `--extra-engine-args` aligned with the runtime you want to mimic
  (for the published vLLM deployment, that is typically `64`)

403
**Key Dataset Properties:**
404
405
406
-**Realistic timing:** Request arrival patterns from production tool-agent workloads
-**High prefix overlap:** 59% cache ratio ([Mooncake FAST'25 paper](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/Mooncake-FAST25.pdf)); iterative tool calls within sessions produce natural prefix reuse
-**Privacy-preserving:** No actual text — only hash-based cache block identifiers
407
408
409
410
411
-**Reproducible:** Public dataset enables fair comparisons across different systems

### Download and Prepare the Dataset

```bash
412
413
# Download the Mooncake FAST'25 toolagent trace
curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl -o toolagent_trace.jsonl
414

415
# Slow down timestamps to 0.80× replay speed (~5.3 req/s instead of ~6.7 req/s)
416
417
418
python3 - <<'PY'
import json

419
with open("toolagent_trace.jsonl") as src, open("toolagent_trace_080x.jsonl", "w") as dst:
420
421
    for line in src:
        rec = json.loads(line)
422
        rec["timestamp"] = int(rec["timestamp"] / 0.80)
423
424
425
        dst.write(json.dumps(rec) + "\n")
PY

426
echo "Dataset ready: toolagent_trace_080x.jsonl (23,608 requests, 0.80x speed)"
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
```

---

## Phase 4: Set Up Benchmark Environment

### Step 4.1: Deploy Benchmark Pod

Create `benchmark-job.yaml`:

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: aiperf-benchmark
spec:
  backoffLimit: 1
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: benchmark
449
        image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
450
451
452
453
454
455
456
457
458
459
        securityContext:
          runAsUser: 0  # Required: apt-get and pip install need root in ephemeral benchmark pod
        command:
          - /bin/bash
          - -lc
          - |
            apt-get update -qq && apt-get install -y -qq tmux > /dev/null 2>&1
            pip install -q aiperf==0.5.0
            echo "Benchmark pod ready (tmux + aiperf installed)."
            sleep infinity
460
461
462
463
464
465
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            nvidia.com/gpu: 0
```

466
467
This pod installs `tmux` and `aiperf` on startup so benchmarks can run inside a tmux session that survives `kubectl exec` disconnects.

468
469
470
Deploy:

```bash
471
kubectl apply -f benchmark-job.yaml -n dynamo-bench
472
473
```

474
Wait for pod to be ready (the init takes ~1-2 minutes to install packages):
475
476

```bash
477
kubectl get pods -n dynamo-bench -l job-name=aiperf-benchmark -w
478
479
480
481
482
```

### Step 4.2: Copy Dataset to Benchmark Pod

```bash
483
484
485
486
487
488
489
490
491
POD_NAME=$(kubectl get pods -n dynamo-bench -l job-name=aiperf-benchmark -o jsonpath='{.items[0].metadata.name}')
kubectl -n dynamo-bench cp toolagent_trace_080x.jsonl ${POD_NAME}:/tmp/toolagent_trace_080x.jsonl
```

---

## Phase 5: Run Benchmarks

### Step 5.1: Benchmark Router-ON
492

493
494
495
496
Verify the frontend service is reachable (the operator creates a service named `{deployment-name}-frontend`):

```bash
kubectl get svc -n dynamo-bench | grep frontend
497
498
```

499
Launch the benchmark inside a tmux session so it survives `kubectl exec` disconnects:
500
501

```bash
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
kubectl -n dynamo-bench exec ${POD_NAME} -- bash -c '
  tmux new-session -d -s benchmark ". /opt/dynamo/venv/bin/activate && \
    AIPERF_HTTP_CONNECTION_LIMIT=200 aiperf profile \
      -m Qwen/Qwen3-32B \
      --tokenizer Qwen/Qwen3-32B \
      --input-file /tmp/toolagent_trace_080x.jsonl \
      --custom-dataset-type mooncake_trace \
      --fixed-schedule \
      --url http://vllm-agg-router-frontend.dynamo-bench.svc.cluster.local:8000 \
      --streaming \
      --random-seed 42 \
      --workers-max 200 \
      --request-timeout-seconds 1000 \
      --profile-export-level records \
      --record-processors 8 \
      --artifact-dir /tmp/aiperf_router_on \
      --goodput \"time_to_first_token:5000 inter_token_latency:100\""
'
520
521
```

522
AIPerf writes the run to `/tmp/aiperf_router_on` on the pod (summary JSON and `profile_export.jsonl`).
523

524
525
526
### Monitoring Benchmarks

Benchmarks run inside a **tmux session** so they survive `kubectl exec` disconnects.
527

528
Attach to the live TUI (detach with **Ctrl+B then D**):
529
530

```bash
531
kubectl -n dynamo-bench exec -it ${POD_NAME} -- tmux a -t benchmark
532
533
```

534
535
536
537
538
539
540
541
### Step 5.2: Switch to Router-OFF and Benchmark

Tear down router-ON and deploy the baseline:

```bash
kubectl delete dynamographdeployment vllm-agg-router -n dynamo-bench
kubectl apply -f router-off-deployment.yaml -n dynamo-bench
```
542

543
Wait for 8/8 workers to be Ready again (re-run the health check from [Step 2.4](#step-24-verify-workers-are-healthy)), then clean up the previous tmux session and launch the baseline benchmark:
544
545

```bash
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
kubectl -n dynamo-bench exec ${POD_NAME} -- tmux kill-session -t benchmark 2>/dev/null

kubectl -n dynamo-bench exec ${POD_NAME} -- bash -c '
  tmux new-session -d -s benchmark ". /opt/dynamo/venv/bin/activate && \
    AIPERF_HTTP_CONNECTION_LIMIT=200 aiperf profile \
      -m Qwen/Qwen3-32B \
      --tokenizer Qwen/Qwen3-32B \
      --input-file /tmp/toolagent_trace_080x.jsonl \
      --custom-dataset-type mooncake_trace \
      --fixed-schedule \
      --url http://vllm-agg-no-router-frontend.dynamo-bench.svc.cluster.local:8000 \
      --streaming \
      --random-seed 42 \
      --workers-max 200 \
      --request-timeout-seconds 1000 \
      --profile-export-level records \
      --record-processors 8 \
      --artifact-dir /tmp/aiperf_router_off \
      --goodput \"time_to_first_token:5000 inter_token_latency:100\""
565
566
567
568
569
'
```

### Step 5.3: Collect Results

570
571
Copy the artifact directories (or the summary/export files inside them) to your machine:

572
```bash
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
kubectl -n dynamo-bench cp ${POD_NAME}:/tmp/aiperf_router_on ./aiperf_router_on
kubectl -n dynamo-bench cp ${POD_NAME}:/tmp/aiperf_router_off ./aiperf_router_off
```

Each artifact directory contains:
- `profile_export_aiperf.json` — summary with aggregated metrics (TTFT, latency percentiles, throughput)
- `profile_export.jsonl` — per-request records (one JSON object per completed request)

### Step 5.4: Quick Comparison

Extract and compare key metrics from the two summary files:

```bash
python3 -c "
import json, pathlib

def load(d):
    return json.loads(pathlib.Path(d, 'profile_export_aiperf.json').read_text())

on, off = load('aiperf_router_on'), load('aiperf_router_off')

metrics = [
    ('TTFT avg (ms)',             'time_to_first_token', 'avg'),
    ('TTFT p99 (ms)',             'time_to_first_token', 'p99'),
    ('E2E Latency avg (ms)',      'request_latency',     'avg'),
    ('E2E Latency p99 (ms)',      'request_latency',     'p99'),
    ('Output Throughput (tok/s)', 'output_token_throughput', 'avg'),
]

print(f\"{'Metric':<28} {'Router-OFF':>12} {'Router-ON':>12} {'Speedup':>10}\")
print('-' * 66)
for label, key, stat in metrics:
    v_off = off.get(key, {}).get(stat, 0)
    v_on  = on.get(key, {}).get(stat, 0)
    if 'throughput' in key.lower():
        speedup = v_on / v_off if v_off else 0
    else:
        speedup = v_off / v_on if v_on else 0
    print(f'{label:<28} {v_off:>12.1f} {v_on:>12.1f} {speedup:>9.1f}x')
"
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
```

---

## Phase 6: Analyze Results

### Key Metrics to Compare

| Metric | Description | What to Look For |
|--------|-------------|------------------|
| **Time to First Token (TTFT)** | Latency until first token arrives | Lower is better; KV router may reduce with prefix reuse |
| **Inter Token Latency (ITL)** | Average time between tokens | Lower is better; indicates generation speed |
| **Request Latency** | Total end-to-end latency | Lower is better; overall user experience |
| **Output Token Throughput** | Tokens generated per second (system-wide) | Higher is better; system efficiency |
| **Request Throughput** | Requests completed per second | Higher is better; capacity |

### Interpreting Results

**Your Results May Vary**: The improvement from KV Smart Router depends heavily on your workload characteristics:

**Factors that increase KV router benefit:**
- **High prefix overlap** (shared system prompts, templates, document contexts)
- **Long prompts** (>2000 tokens) where caching saves significant compute
- **Multi-turn conversations** with context carryover
- **Batch workloads** with similar queries

**Factors that reduce KV router benefit:**
- **Unique prompts** with no prefix reuse
641
- **Short prompts** (less than 1000 tokens) where routing overhead exceeds benefit
642
643
644
645
646
647
648
649
650
- **Evenly distributed load** where round-robin is already optimal
- **Low request rate** where cache eviction negates benefits

**KV Smart Router is beneficial when:**
- TTFT improvements > 20%
- No significant degradation in other metrics
- Workload demonstrates measurable prefix reuse patterns

**Standard routing is better when:**
651
- KV router shows less than 10% improvement
652
653
654
655
656
- Increased latency variance is observed
- Load distribution across workers is more important than cache affinity

### Example Comparison

657
From our Dynamo Operator benchmark with the full toolagent trace at 0.80× replay speed:
658

659
660
661
662
663
664
| Metric | Router-OFF (Baseline) | Router-ON (KV Router) | Improvement | Speedup |
|--------|----------------------|----------------------|-------------|---------|
| TTFT avg | 63,652 ms | 2,586 ms | **96% faster** | 24.6x ✅ |
| TTFT p99 | 332,974 ms | 17,871 ms | **95% faster** | 18.6x ✅ |
| E2E Latency avg | 92,856 ms | 19,112 ms | **79% faster** | 4.9x ✅ |
| E2E Latency p99 | 411,252 ms | 88,274 ms | **79% faster** | 4.7x ✅ |
665

666
667
668
669
In this example with all 8 workers healthy, the **KV router dramatically outperformed** the baseline:
- **96% faster TTFT** — Users see first token in ~2.6s instead of ~64s
- **79% lower E2E latency** — Requests complete in ~19s instead of ~93s
- **95% faster TTFT p99** — Tail latency drops from ~333s to ~18s
670

671
The toolagent trace has heavy prefix overlap from tool-agent sessions with repeated context. Without the KV router, requests with overlapping prefixes are scattered across workers, causing redundant recomputation and unbounded queue growth at high utilization. With the KV router, matching prefixes are routed to the same worker, maximizing cache hits and keeping latencies stable under load.
672
673
674
675
676
677

---

## Phase 7: Cleanup

```bash
678
679
680
kubectl delete dynamographdeployment --all -n dynamo-bench
kubectl delete job aiperf-benchmark -n dynamo-bench
kubectl delete namespace dynamo-bench
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
```

---

## Troubleshooting

### Issue: Pods Stuck in Pending

**Cause:** Insufficient GPU resources

**Solution:**
```bash
# Check GPU availability
kubectl describe nodes | grep -A 10 "Allocated resources"

# Reduce worker replicas if needed
697
kubectl edit dynamographdeployment -n dynamo-bench
698
699
700
701
702
703
704
705
706
```

### Issue: ImagePullBackOff Errors

**Cause:** Version mismatch or missing credentials

**Solution:**
```bash
# Check available versions
707
kubectl get pods -n dynamo-bench -o yaml | grep image:
708
709
710
711
712
713
714
715
716
717

# Update deployment YAML to match cluster version
```

### Issue: Operator Not Processing Deployment

**Cause:** Namespace restrictions

**Solution:**
- Ensure Dynamo platform is Helm-installed in the namespace
718
719
- Verify operator has `--restrictedNamespace=dynamo-bench` argument
- Check operator logs: `kubectl logs -n dynamo-bench deployment/dynamo-platform-dynamo-operator-controller-manager`
720
721
722
723
724
725
726
727

### Issue: Workers Not Becoming Ready

**Cause:** Model download failures or probe configuration

**Solution:**
```bash
# Check worker logs
728
kubectl logs -n dynamo-bench <worker-pod-name>
729
730
731
732
733
734
735
736
737

# Common issues:
# - Invalid HuggingFace token
# - Network connectivity
# - Insufficient disk space for model
```

### Issue: Workers Restarting in CrashLoopBackOff

738
**Cause:** Startup probe timeout — workers killed before finishing initialization
739
740
741
742
743
744

**Symptoms:**
- Pods show "Container main failed startup probe, will be restarted"
- Logs show model still downloading or loading when pod is killed

**Solution:**
745
The deployment YAMLs in this guide set `failureThreshold: 60`, allowing up to 32 minutes (`120s + 60×30s`). If you lowered this value or are using a larger model that needs more time, increase it:
746
747

```bash
748
749
kubectl patch dynamographdeployment <deployment-name> -n dynamo-bench --type='json' \
  -p='[{"op": "replace", "path": "/spec/services/VllmDecodeWorker/extraPodSpec/mainContainer/startupProbe/failureThreshold", "value": 80}]'
750
751
```

752
The relevant startup probe fields:
753
754
755
756
757
758
759
760
```yaml
startupProbe:
  httpGet:
    path: /health
    port: 9090
  initialDelaySeconds: 120
  periodSeconds: 30
  timeoutSeconds: 10
761
  failureThreshold: 60  # 32 minutes total (120s + 60*30s); increase for larger models
762
763
764
765
766
767
768
769
770
771
772
773
774
```

**Model Loading Times (approximate):**
- Qwen3-32B: ~20-25 minutes (first download)
- With cached model on node: ~2-5 minutes

### Issue: Unequal Worker Health

**Cause:** Resource constraints, image pull issues, or configuration errors

**Solution:**
```bash
# Check all worker status
775
kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker
776
777

# Describe problematic pods
778
kubectl describe pod <pod-name> -n dynamo-bench
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798

# Fix issues before benchmarking or results will be skewed
```

---

## Advanced Configuration

### Testing Different Models

Replace `Qwen/Qwen3-32B` with your model in:
- Deployment YAML `args` section
- AIPerf `--model` and `--tokenizer` parameters

### Adjusting Worker Count

Change `replicas: 8` in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.

### Using Custom Datasets

799
Replace the Mooncake trace with your own JSONL file:
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
- Format: One request per line with `timestamp` field
- AIPerf supports various formats via `--custom-dataset-type`

### Disaggregated Prefill/Decode

For advanced testing, add separate prefill workers:

```yaml
VllmPrefillWorker:
  componentType: worker
  replicas: 2
  # ... configuration
```

---

## Best Practices

1. **Equal Conditions:** Ensure both deployments have identical worker counts and health before benchmarking
2. **Warm-Up:** Run a small test (100 requests) before the full benchmark to warm up caches
3. **Multiple Runs:** Run benchmarks 3+ times and average results for statistical significance
4. **Monitor Workers:** Watch for any pod restarts or issues during benchmark runs
5. **Document Conditions:** Record cluster state, worker health, and any anomalies
823
6. **Consistent Configuration:** Use the same trace file and AIPerf options for both runs
824
825
826
827
828

---

## Conclusion

829
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit. For further details on tuning the KV router, see the [Tuning Guidelines](../components/router/router-guide.md#tuning-guidelines).
830
831
832
833
834
835
836
837
838
839

For questions or issues, consult the [Dynamo documentation](https://github.com/ai-dynamo/dynamo) or open an issue on GitHub.

---

## Appendix: Files Reference

- `router-off-deployment.yaml`: Standard routing deployment
- `router-on-deployment.yaml`: KV router enabled deployment
- `benchmark-job.yaml`: AIPerf benchmark pod
840
- AIPerf artifact dirs: summary JSON and `profile_export.jsonl` per run
841
842

**Repository:** [https://github.com/ai-dynamo/dynamo](https://github.com/ai-dynamo/dynamo)