README.md 5.39 KB
Newer Older
1
2
3
4
# Qwen3-32B: Aggregated Round Robin vs Disaggregated KV Routing Comparison

This recipe demonstrates the performance difference between **aggregated (round-robin)** and **disaggregated (KV-aware)** routing using a real-world conversation trace dataset from the [Mooncake FAST25 paper](https://github.com/kvcache-ai/Mooncake).

5
6
7
8
9
## Results

https://github.com/user-attachments/assets/c425002b-4459-47c4-bfca-fd1e2620500c


10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
## Experiment Overview

We compare two deployment modes on **16x H200 GPUs across 2 nodes**:

| Mode | Routing | Configuration |
|------|---------|---------------|
| **Aggregated** | Round-robin | 8x TP2 workers |
| **Disaggregated** | KV-aware | 6x prefill + 2x decode (TP2) |

## Dataset: Mooncake Conversation Trace

The benchmark uses a production conversation trace with significant prefix sharing potential:

| Metric | Value |
|--------|-------|
| Requests | 12,031 over ~59 minutes (3.4 req/s) |
| Input tokens/sec | 40,937 tok/s |
| Input length | avg 12,035 tokens (range: 891 - 126,195) |
| Output length | avg 343 tokens |

**Cache Reuse Analysis:**

| Metric | Value | What It Measures |
|--------|-------|------------------|
| Blocks reused | 24.2% | Of 182,790 unique blocks, 44,144 appeared in more than one request |
| Cache efficiency | 36.64% | Of 288,500 total block references, 105,710 were repeats (reusable with infinite cache) |

*Why these differ:* Block reuse counts unique blocks that repeat, ignoring how often they repeat. Cache efficiency weights by frequency—a block reused 12,031 times contributes more than one reused once.

This workload is ideal for KV-aware routing—with 36.64% cache efficiency, requests can be routed to workers that already have relevant KV blocks cached, significantly reducing TTFT.

## Prerequisites

1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **16x H200 GPUs** across 2 nodes
3. **HuggingFace token** configured:
   ```bash
   export NAMESPACE=your-namespace
   kubectl create secret generic hf-token-secret \
     --from-literal=HF_TOKEN="your-token" \
     -n ${NAMESPACE}
   ```

## Quick Start

### 1. Create Storage

> **Note:** Edit `model-cache/cache.yaml` first and update `storageClassName` to match your cluster (run `kubectl get storageclass` to find available options).

```bash
kubectl apply -f model-cache/cache.yaml -n ${NAMESPACE}
```

### 2. Download Model

```bash
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
```

### 3. Deploy & Benchmark

**Option A: Aggregated (Round-Robin Baseline)**

```bash
# Deploy
kubectl apply -f vllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}

# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-8xtp2 \
  -n ${NAMESPACE} --timeout=1200s

# Run benchmark
kubectl apply -f vllm/agg-round-robin/perf.yaml -n ${NAMESPACE}
```

**Option B: Disaggregated (KV-Aware Routing)**

```bash
# Deploy
kubectl apply -f vllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}

# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-router-6p-2d \
  -n ${NAMESPACE} --timeout=1200s

# Run benchmark
kubectl apply -f vllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
```

### 4. Monitor Benchmark Progress

The benchmark runs inside a tmux session for easy monitoring:

```bash
# Find the benchmark pod
kubectl get pods -n ${NAMESPACE} | grep benchmark

# Attach to the tmux session to see intermediate results
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark

# Detach from tmux: Ctrl+B, then D
```

### 5. View Results

Results are saved to the `perf-cache` PVC:

```bash
# Check artifact directory
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/

# Copy results to local machine
kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-results
```

## Expected Results

Since the benchmark uses `--fixed-schedule` (replaying requests at their original timestamps), **throughput metrics are fixed by the trace**—latency metrics are what we're comparing:

| Metric | Why It Matters |
|--------|----------------|
| **TTFT** (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits |
| **ITL** (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference |
| **Total Request Latency** | Combined benefit of both optimizations |

**Why disaggregated + KV-aware routing helps this workload:**

1. **KV-aware routing** leverages the 36% cache efficiency to route requests to workers that already have relevant KV cache blocks, reducing redundant prefill computation and lowering TTFT.

2. **Disaggregated serving** separates prefill and decode workers. With long input sequences (avg 12K tokens) and short outputs (avg 343 tokens), dedicated decode workers avoid "prefill injection"—where a new long-context request interrupts ongoing decode operations, causing ITL spikes.

## Cleanup

```bash
# Delete benchmark pods
kubectl delete pod -l app=benchmark -n ${NAMESPACE}

# Delete deployments
kubectl delete dynamographdeployment agg-8xtp2 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-router-6p-2d-n ${NAMESPACE}
```

## References

- [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://github.com/kvcache-ai/Mooncake) - FAST25 paper and trace data