README.md 7.5 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP

This **GB200 NVL72** recipe for DeepSeek V3.2 demonstrates the performance difference between **aggregated (round-robin) routing** and **disaggregated (KV-aware) routing + WideEP** on a synthetic trace dataset adapted from the [Mooncake FAST25 paper](https://github.com/kvcache-ai/Mooncake).

## Results

https://github.com/user-attachments/assets/fcdb703c-7c1a-4109-a7ca-54196fcef885

## Experiment Overview

We compare two deployment modes on **32x GB200 GPUs across 8 nodes**:

| Mode | Routing | Configuration |
|------|---------|---------------|
| **Aggregated** | Round-robin | 4x DEP8 workers |
| **Disaggregated** | KV-aware | 2x prefill + 2x decode w/ WideEP (DEP8) |

## Dataset: Mooncake-based Synthetic Coding Trace

The benchmark uses a trace which simulates coding workloads. We synthesize the trace by increasing the input sequence length and prefix reuse rate of the original [Mooncake conversation trace](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/traces/conversation_trace.jsonl).

To reproduce our benchmark, run Dynamo's [prefix data generator tool](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/prefix_data_generator) on the Mooncake `conversation_trace.jsonl`:
```bash
datagen synthesize \
    --input-file conversation_trace.jsonl \
    --prefix-len-multiplier 16 \
    --prompt-len-multiplier 10 \
    --max-isl 110000 \
    --num-requests 10000
# synthesizes `conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl`
```

The ISL/OSL/cache hit statistics of our trace is below.

<details>
<summary>Dataset statistics: Mooncake-based Synthetic Trace</summary>

```
============================================================
  DATASET ANALYSIS: Mooncake-based Synthetic Trace
  ============================================================
  OVERVIEW
  ----------------------------------------
    Total Requests:      10,000
    Unique Hash Blocks:  430,838
    Total Hash Blocks:   770,934
  INPUT SEQUENCE LENGTH (ISL)
  ----------------------------------------
    Average:             39,186 tokens
    Maximum:             109,459 tokens
    Minimum:             12,801 tokens
  OUTPUT SEQUENCE LENGTH (OSL)
  ----------------------------------------
    Average:             344 tokens
    Maximum:             2,000 tokens
    Minimum:             1 tokens
  KV CACHE / PREFIX REUSE
  ----------------------------------------
    Block-level Hit Rate: 44.1%
    Token-level Hit Rate: 44.0%
    Avg Context (shared): 22,400 tokens/req
    Avg Unique Prompt:    16,786 tokens/req
    Shared Prefix Ratio:  57.2%
  ============================================================

  Summary:
  • ~44% KV cache hit rate (block/token level) based on hash_id overlap across requests
  • ~57% of input tokens come from shared context prefixes
  • Long-context workload: avg 39K input tokens, up to 109K max
```

</details>


## Prerequisites

1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **32x GB200 GPUs** across 8 nodes
3. **HuggingFace token** configured:
   ```bash
   export NAMESPACE=your-namespace
   kubectl create secret generic hf-token-secret \
     --from-literal=HF_TOKEN="your-token" \
     -n ${NAMESPACE}
   ```

## Quick Start

### 1. Create Storage

> **Note:** Edit `model-cache/model-cache.yaml` first and update `storageClassName` to match your cluster (run `kubectl get storageclass` to find available options).

```bash
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
```

### 2. Configure K8 Benchmarking Environment
For multinode kubernetes deployments, your cluster may require a ComputeDomain to exist in your namespace such that the DRA scheduler can co-locate worker pods on MNNVL-connected nodes. (Otherwise, internode GPU peer memory access would fail.)
```bash
kubectl apply -f model-cache/compute-domain.yaml -n ${NAMESPACE}
```
Make sure to apply any name modifications to this file to the deployment yamls, under `extraPodSpec.resourceClaims` and `mainContainer.resources.claims`.


### 3. Setup Model and Data
We use NVIDIA's official NVFP4-quantized checkpoint ([Huggingface](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4)). Copy it into the PVC storage:

```bash
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
```
Similarly, copy the trace file for the benchmark into the PVC:
```bash
# conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl in our case
kubectl cp <local_trace.jsonl> your-namespace/<helper-pod>:/model-cache/traces/
```

### 4. Deploy & Benchmark

**Option A: Aggregated (Round-Robin Baseline)**

```bash
# Deploy
kubectl apply -f trtllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}

# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-round-robin-dsv32-nvfp4 \
  -n ${NAMESPACE} --timeout=1200s

# Run benchmark
kubectl apply -f trtllm/agg-round-robin/perf.yaml -n ${NAMESPACE}
```

**Option B: Disaggregated (KV-Aware Routing)**

```bash
# Deploy
kubectl apply -f trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}

# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
  -n ${NAMESPACE} --timeout=1200s

# Run benchmark
kubectl apply -f trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
```

### 4. Monitor Benchmark Progress

The benchmark runs inside a tmux session for easy monitoring:

```bash
# Find the benchmark pod
kubectl get pods -n ${NAMESPACE} | grep benchmark

# Attach to the tmux session to see intermediate results
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark

# Detach from tmux: Ctrl+B, then D
```

### 5. View Results

Results are saved to the `perf-cache` PVC:

```bash
# Check artifact directory
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/

# Copy results to local machine
kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-results
```

## Expected Results

Since the benchmark uses `--fixed-schedule` (replaying requests at their original timestamps), **throughput metrics are fixed by the trace**—latency metrics are what we're comparing:

| Metric | Why It Matters |
|--------|----------------|
| **TTFT** (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits |
| **ITL** (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference |
| **Total Request Latency** | Combined benefit of both optimizations |

For production contexts, we can further evaluate the deployments with **goodput**, i.e. the rate of requests which satisfy a predetermined service level agreement (SLA). For our experiments, we set the SLA as TTFT=20s and ITL=50ms.

## Cleanup

```bash
# Delete benchmark pods
kubectl delete job agg-round-robin-dsv32-nvfp4-bench disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}

# Delete deployments
kubectl delete dynamographdeployment agg-round-robin-dsv32-nvfp4 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-kv-dsv32-nvfp4 -n ${NAMESPACE}
```

## References

- [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://github.com/kvcache-ai/Mooncake) - FAST25 paper and trace data
- [Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.html) - TRTLLM tech blog on available optimizations for DSV3.2 on GB200