This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
## Overview
## Overview
Dynamo's KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:
Dynamo's KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:
1. Deploy two identical Dynamo configurations:
1. Deploy two identical Dynamo configurations:
...
@@ -28,7 +29,7 @@ Dynamo's KV Smart Router intelligently routes requests based on KV cache affinit
...
@@ -28,7 +29,7 @@ Dynamo's KV Smart Router intelligently routes requests based on KV cache affinit
- HuggingFace account and token (if model downloads are gated)
- HuggingFace account and token (if model downloads are gated)
- Kubernetes cluster with:
- Kubernetes cluster with:
- GPU nodes (H100, H200, or similar)
- GPU nodes (H100, H200, or similar)
- Sufficient GPU capacity (16+ GPUs recommended for this example)
- Sufficient GPU capacity (8+ GPUs recommended for this example)
- Dynamo platform installed globally OR ability to install per-namespace
- Dynamo platform installed globally OR ability to install per-namespace
### Knowledge Requirements
### Knowledge Requirements
...
@@ -41,28 +42,23 @@ Dynamo's KV Smart Router intelligently routes requests based on KV cache affinit
...
@@ -41,28 +42,23 @@ Dynamo's KV Smart Router intelligently routes requests based on KV cache affinit
## Architecture
## Architecture
This guide sets up two parallel deployments, as well as a benchmarking pod that can test each deployment:
This guide uses a single namespace. We deploy one configuration (e.g. router-ON), run the benchmark, tear it down, then deploy the other (router-OFF) and run the same benchmark.
```text
```text
┌─────────────────────────────────────┐
┌──────────────────────────────────────────────┐
│ Deployment A: Router OFF │
│ Namespace: dynamo-bench │
│ Namespace: router-off-test │
│ (one of A or B active at a time) │
│ ├─ Frontend (Standard Routing) │
│ │
│ └─ 8x Decode Workers (1 GPU each) │
│ Deployment A: Router OFF │
└─────────────────────────────────────┘
│ ├─ Frontend (Standard Routing) │
│ └─ 8x Decode Workers (1 GPU each) │
┌─────────────────────────────────────┐
│ │
│ Deployment B: Router ON │
│ Deployment B: Router ON │
│ Namespace: router-on-test │
│ ├─ Frontend (KV Smart Router) │
│ ├─ Frontend (KV Smart Router) │
│ └─ 8x Decode Workers (1 GPU each) │
│ └─ 8x Decode Workers (1 GPU each) │
│ │
└─────────────────────────────────────┘
│ Benchmark Pod (AIPerf + Dataset) │
└──────────────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Benchmark Pod │
│ Namespace: benchmark │
│ └─ AIPerf + Dataset │
└─────────────────────────────────────┘
```
```
**Key Difference:** Deployment B sets `DYN_ROUTER_MODE=kv` on the frontend to enable KV cache-aware routing.
**Key Difference:** Deployment B sets `DYN_ROUTER_MODE=kv` on the frontend to enable KV cache-aware routing.
...
@@ -71,13 +67,10 @@ This guide sets up two parallel deployments, as well as a benchmarking pod that
...
@@ -71,13 +67,10 @@ This guide sets up two parallel deployments, as well as a benchmarking pod that
If the model you're seeking to deploy requires HF token to download (Llama family models require this), replace `YOUR_HF_TOKEN` with your actual HuggingFace token:
If the model you're seeking to deploy requires HF token to download (Llama family models require this), replace `YOUR_HF_TOKEN` with your actual HuggingFace token:
```bash
```bash
# Router-OFF namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="YOUR_HF_TOKEN"\
-n router-off-test
# Router-ON namespace
kubectl create secret generic hf-token-secret \
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="YOUR_HF_TOKEN"\
--from-literal=HF_TOKEN="YOUR_HF_TOKEN"\
-nrouter-on-test
-ndynamo-bench
```
```
### Step 1.3: Install Dynamo Platform (Per-Namespace)
### Step 1.3: Install Dynamo Platform
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation-guide.md) to install the platform in both namespaces:
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in the workload namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation-guide.md) to install the platform in `dynamo-bench`.
-`router-off-test`
-`router-on-test`
**Key Configuration Notes:**
**Key Configuration Notes:**
- If your cluster uses namespace restrictions, ensure `dynamo-operator.namespaceRestriction.enabled=true` is set during installation
- If your cluster uses namespace restrictions, ensure `dynamo-operator.namespaceRestriction.enabled=true` is set during installation
...
@@ -110,20 +94,11 @@ If your cluster uses namespace-restricted Dynamo operators, you'll need to insta
...
@@ -110,20 +94,11 @@ If your cluster uses namespace-restricted Dynamo operators, you'll need to insta
### Step 1.4: Verify Infrastructure
### Step 1.4: Verify Infrastructure
Wait for operators and infrastructure to be ready:
```bash
```bash
# Check router-off-test
kubectl get pods -n dynamo-bench
kubectl get pods -n router-off-test
# Check router-on-test
kubectl get pods -n router-on-test
```
```
You should see:
Expect operator, etcd, and nats pods Running before deploying the graph.
**💡 Optimization Tip:** Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with `ReadWriteMany` access mode to cache the model.
**💡 Optimization Tip:** Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with `ReadWriteMany` access mode to cache the model.
First, create the PVC separately:
First, create the PVC in the same namespace as your deployment (e.g. `dynamo-bench`). Use a storage class that supports ReadWriteMany:
```bash
kubectl get storageclass # choose one with ReadWriteMany (e.g. azurefile-csi-premium, nfs, efs)
storageClassName:"azurefile-csi-premium"# Adjust to your cluster
resources:
resources:
requests:
requests:
storage:100Gi
storage:100Gi
```
```
Then reference it in your DynamoGraphDeployment:
Apply it: `kubectl apply -f pvc-model-cache.yaml`
Then reference the existing PVC in your DynamoGraphDeployment by adding the following under `spec` (and under `VllmDecodeWorker`, add `volumeMounts`):
```yaml
```yaml
spec:
spec:
...
@@ -314,16 +338,12 @@ spec:
...
@@ -314,16 +338,12 @@ spec:
useAsCompilationCache:false
useAsCompilationCache:false
```
```
With this configuration, only the first worker downloads the model; others use the cached version, reducing startup time from 20+ minutes to ~2 minutes per pod.
With this configuration, the first run has one worker download; the rest load from cache. The main benefit is on redeploy: the model stays on the PVC, so new pods load from cache and come up in ~5–10 minutes instead of downloading again.
### Step 2.3: Monitor Deployment Progress
### Step 2.3: Monitor Deployment Progress
```bash
```bash
# Watch router-OFF pods
kubectl get pods -n dynamo-bench -w
kubectl get pods -n router-off-test -w
# Watch router-ON pods
kubectl get pods -n router-on-test -w
```
```
Wait for all pods to reach `Running` status and pass readiness probes.
Wait for all pods to reach `Running` status and pass readiness probes.
...
@@ -333,113 +353,69 @@ Wait for all pods to reach `Running` status and pass readiness probes.
...
@@ -333,113 +353,69 @@ Wait for all pods to reach `Running` status and pass readiness probes.
-**Without shared PVC**: 20-30 minutes per worker (workers download independently)
-**Without shared PVC**: 20-30 minutes per worker (workers download independently)
- For 8 workers: Budget **1-2 hours** for full deployment (workers start in parallel but are limited by node scheduling)
- For 8 workers: Budget **1-2 hours** for full deployment (workers start in parallel but are limited by node scheduling)
The startup probe allows 32 minutes per pod (failureThreshold: 60), which accommodates model download and initialization.
The deployment's startup probe (`initialDelaySeconds: 120`, `periodSeconds: 30`, `failureThreshold: 60`) allows up to 32 minutes per pod for model download and initialization.
### Step 2.4: Verify All Workers Are Healthy
### Step 2.4: Verify Workers Are Healthy
> ⚠️ **CRITICAL CHECKPOINT**: Before running benchmarks, you **MUST** verify equal worker health in both deployments. Unequal worker counts will invalidate your comparison results.
> ⚠️ **CRITICAL CHECKPOINT**: Before running benchmarks, you **MUST** verify equal worker health. Unequal worker counts will invalidate your comparison results.
echo"Router ON: $(kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
# Detailed view
# Detailed view
kubectl get pods -n router-off-test -l nvidia.com/dynamo-component-type=worker
kubectl get pods -n dynamo-bench -l nvidia.com/dynamo-component-type=worker
kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker
```
```
**Both must show 8/8 workers in Ready state (1/1 Running).** If workers are not ready:
**All 8 must show `1/1 Running` and Ready.** Do not proceed until this is confirmed. Repeat this check after you tear down router-ON and deploy router-OFF (Phase 5).
- Common issues: model download in progress, startup probe timeout, insufficient GPU resources
**Do not proceed with benchmarks until all 16 workers (8 per deployment) are healthy.**
---
---
## Phase 3: Prepare Benchmark Dataset
## Phase 3: Prepare Benchmark Dataset
### Understanding the Mooncake Trace Dataset
### Understanding the Mooncake Toolagent Trace
For this A/B comparison, we use the **Mooncake Trace Dataset**, published by [Mooncake AI](https://github.com/kvcache-ai/Mooncake). This is a privacy-preserving dataset of real-world LLM inference traffic from production arxiv workloads.
For this A/B comparison, we use the [**Mooncake FAST'25 Toolagent Trace**](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/traces/toolagent_trace.jsonl), published by [Mooncake AI](https://github.com/kvcache-ai/Mooncake)(USENIX FAST'25 Best Paper). This is a privacy-preserving dataset of real-world LLM inference traffic from production **tool-agent workloads** — AI agents that iteratively call tools and APIs while maintaining a growing conversation context. The trace contains **23,608 requests** spanning ~59 minutes of real-time traffic.
**Why the toolagent trace?** Tool-agent workloads are ideal for evaluating KV cache routing because each agent session involves repeated LLM calls that share a long, growing prefix (system prompt + conversation history + tool results), producing high natural prefix overlap between requests. The Mooncake toolagent trace captures these realistic patterns, letting us demonstrate the router's real-world performance gains.
**What's in the dataset?** Each trace entry contains:
**What's in the dataset?** Each trace entry contains:
-**Timestamp:** When the request arrived (for realistic request timing)
-**Timestamp:** When the request arrived (for realistic request timing)
-**Input/output lengths:** Number of tokens in prompts and responses
-**Input/output lengths:** Number of tokens in prompts and responses
### Why Mooncake Traces Matter for KV Cache Benchmarking
These two requests share blocks 46–57 (12 blocks × 512 tokens = ~6,144 tokens of shared prefix) — a tool agent continuing the same session with accumulated context. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks, preserving the pattern of prefix reuse while protecting user privacy. The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits.
**The Challenge:** Traditional LLM benchmarks use synthetic or random data, which are often insufficient to capture real-world optimizations like KV Smart Router. To properly evaluate this feature, we need realistic traffic patterns with **prefix repetition** - but this creates a privacy problem: how do we measure realistic KV cache hit patterns without exposing actual user conversations?
Instead of storing actual prompt text, the Mooncake dataset uses cryptographic hashes to represent KV cache blocks. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks. This preserves the **pattern of prefix reuse** while completely protecting user privacy.
### How it works - Multi-turn conversation example
```text
Turn 1 (initial request - long document analysis):
Input: ~8,000 tokens (e.g., research paper + question)
└──────────── Reuses first 16 blocks (~8,192 tokens) ───────────────┘
```
When requests share the same hash IDs (e.g., blocks 46-61), it means they share those 512-token blocks - indicating **significant prefix overlap** (in this case, 8,192 tokens). The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits and avoiding redundant computation for those shared prefix tokens.
**Key Dataset Properties:**
**Key Dataset Properties:**
- ✅ **Realistic timing:** Request arrival patterns from production workloads
- ✅ **Realistic timing:** Request arrival patterns from production tool-agent workloads
- ✅ **Real prefix patterns:** Up to 50% cache hit ratio ([Mooncake technical report](https://github.com/kvcache-ai/Mooncake))
- ✅ **High prefix overlap:** 59% cache ratio ([Mooncake FAST'25 paper](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/Mooncake-FAST25.pdf)); iterative tool calls within sessions produce natural prefix reuse
- ✅ **Privacy-preserving:** No actual text - only hash-based cache block identifiers
- ✅ **Privacy-preserving:** No actual text — only hash-based cache block identifiers
- ✅ **Reproducible:** Public dataset enables fair comparisons across different systems
- ✅ **Reproducible:** Public dataset enables fair comparisons across different systems
**Why this matters:** With random synthetic data, the KV Smart Router would show no benefit because there's no prefix reuse to exploit. Mooncake traces provide realistic workload patterns that demonstrate the router's real-world performance gains while respecting user privacy.
Wait for 8/8 workers to be Ready again (re-run the health check from [Step 2.4](#step-24-verify-workers-are-healthy)), then clean up the previous tmux session and launch the baseline benchmark:
-**95% faster TTFT p99** — Tail latency drops from ~333s to ~18s
Request Throughput: 9.33 req/sec (8% higher ✅)
```
In this example with all 8 workers healthy, the **KV router significantly outperformed** the baseline:
The toolagent trace has heavy prefix overlap from tool-agent sessions with repeated context. Without the KV router, requests with overlapping prefixes are scattered across workers, causing redundant recomputation and unbounded queue growth at high utilization. With the KV router, matching prefixes are routed to the same worker, maximizing cache hits and keeping latencies stable under load.
-**37% faster TTFT** - Users see first token much sooner
-**8% higher throughput** - System processes more requests per second
The Mooncake arxiv dataset has sufficient prefix overlap (long input sequences with similar patterns) to benefit from KV cache-aware routing. Workloads with explicit shared prefixes (system prompts, templates) may see even greater improvements.
**Cause:** Startup probe timeout - workers killed before finishing initialization
**Cause:** Startup probe timeout — workers killed before finishing initialization
**Symptoms:**
**Symptoms:**
- Pods show "Container main failed startup probe, will be restarted"
- Pods show "Container main failed startup probe, will be restarted"
- Logs show model still downloading or loading when pod is killed
- Logs show model still downloading or loading when pod is killed
- Large models (>30GB) take longer than default 22-minute timeout
**Solution:**
**Solution:**
Increase the startup probe `failureThreshold`:
The deployment YAMLs in this guide set `failureThreshold: 60`, allowing up to 32 minutes (`120s + 60×30s`). If you lowered this value or are using a larger model that needs more time, increase it:
```bash
```bash
# Patch the deployment to allow 32 minutes instead of 22