# DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP
This **GB200 NVL72** recipe for DeepSeek V3.2 demonstrates the performance difference between **aggregated (round-robin) routing** and **disaggregated (KV-aware) routing + WideEP** on a synthetic trace dataset adapted from the [Mooncake FAST25 paper](https://github.com/kvcache-ai/Mooncake).
## Results
https://github.com/user-attachments/assets/fcdb703c-7c1a-4109-a7ca-54196fcef885
## Experiment Overview
We compare two deployment modes on **32x GB200 GPUs across 8 nodes**:
| Mode | Routing | Configuration |
|------|---------|---------------|
| **Aggregated** | Round-robin | 4x DEP8 workers |
| **Disaggregated** | KV-aware | 2x prefill + 2x decode w/ WideEP (DEP8) |
## Dataset: Mooncake-based Synthetic Coding Trace
The benchmark uses a trace which simulates coding workloads. We synthesize the trace by increasing the input sequence length and prefix reuse rate of the original [Mooncake conversation trace](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/traces/conversation_trace.jsonl).
To reproduce our benchmark, run Dynamo's [prefix data generator tool](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/prefix_data_generator) on the Mooncake `conversation_trace.jsonl`:
```bash
datagen synthesize \
--input-file conversation_trace.jsonl \
--prefix-len-multiplier 16 \
--prompt-len-multiplier 10 \
--max-isl 110000 \
--num-requests 10000
# synthesizes `conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl`
```
The ISL/OSL/cache hit statistics of our trace is below.
Dataset statistics: Mooncake-based Synthetic Trace
```
============================================================
DATASET ANALYSIS: Mooncake-based Synthetic Trace
============================================================
OVERVIEW
----------------------------------------
Total Requests: 10,000
Unique Hash Blocks: 430,838
Total Hash Blocks: 770,934
INPUT SEQUENCE LENGTH (ISL)
----------------------------------------
Average: 39,186 tokens
Maximum: 109,459 tokens
Minimum: 12,801 tokens
OUTPUT SEQUENCE LENGTH (OSL)
----------------------------------------
Average: 344 tokens
Maximum: 2,000 tokens
Minimum: 1 tokens
KV CACHE / PREFIX REUSE
----------------------------------------
Block-level Hit Rate: 44.1%
Token-level Hit Rate: 44.0%
Avg Context (shared): 22,400 tokens/req
Avg Unique Prompt: 16,786 tokens/req
Shared Prefix Ratio: 57.2%
============================================================
Summary:
• ~44% KV cache hit rate (block/token level) based on hash_id overlap across requests
• ~57% of input tokens come from shared context prefixes
• Long-context workload: avg 39K input tokens, up to 109K max
```
## Prerequisites
1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **32x GB200 GPUs** across 8 nodes
3. **HuggingFace token** configured:
```bash
export NAMESPACE=your-namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token" \
-n ${NAMESPACE}
```
## Quick Start
### 1. Create Storage
> **Note:** Edit `model-cache/model-cache.yaml` first and update `storageClassName` to match your cluster (run `kubectl get storageclass` to find available options).
```bash
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
```
### 2. Configure K8 Benchmarking Environment
For multinode kubernetes deployments, your cluster may require a ComputeDomain to exist in your namespace such that the DRA scheduler can co-locate worker pods on MNNVL-connected nodes. (Otherwise, internode GPU peer memory access would fail.)
```bash
kubectl apply -f model-cache/compute-domain.yaml -n ${NAMESPACE}
```
Make sure to apply any name modifications to this file to the deployment yamls, under `extraPodSpec.resourceClaims` and `mainContainer.resources.claims`.
### 3. Setup Model and Data
We use NVIDIA's official NVFP4-quantized checkpoint ([Huggingface](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4)). Copy it into the PVC storage:
```bash
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
```
Similarly, copy the trace file for the benchmark into the PVC:
```bash
# conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl in our case
kubectl cp your-namespace/:/model-cache/traces/
```
### 4. Deploy & Benchmark
**Option A: Aggregated (Round-Robin Baseline)**
```bash
# Deploy
kubectl apply -f trtllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}
# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-round-robin-dsv32-nvfp4 \
-n ${NAMESPACE} --timeout=1200s
# Run benchmark
kubectl apply -f trtllm/agg-round-robin/perf.yaml -n ${NAMESPACE}
```
**Option B: Disaggregated (KV-Aware Routing)**
```bash
# Deploy
kubectl apply -f trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
-n ${NAMESPACE} --timeout=1200s
# Run benchmark
kubectl apply -f trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
```
### 4. Monitor Benchmark Progress
The benchmark runs inside a tmux session for easy monitoring:
```bash
# Find the benchmark pod
kubectl get pods -n ${NAMESPACE} | grep benchmark
# Attach to the tmux session to see intermediate results
kubectl exec -it -n ${NAMESPACE} -- tmux a -t benchmark
# Detach from tmux: Ctrl+B, then D
```
### 5. View Results
Results are saved to the `perf-cache` PVC:
```bash
# Check artifact directory
kubectl exec -it -n ${NAMESPACE} -- ls -la /perf-cache/artifacts/
# Copy results to local machine
kubectl cp ${NAMESPACE}/:/perf-cache/artifacts ./benchmark-results
```
## Expected Results
Since the benchmark uses `--fixed-schedule` (replaying requests at their original timestamps), **throughput metrics are fixed by the trace**—latency metrics are what we're comparing:
| Metric | Why It Matters |
|--------|----------------|
| **TTFT** (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits |
| **ITL** (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference |
| **Total Request Latency** | Combined benefit of both optimizations |
For production contexts, we can further evaluate the deployments with **goodput**, i.e. the rate of requests which satisfy a predetermined service level agreement (SLA). For our experiments, we set the SLA as TTFT=20s and ITL=50ms.
## Cleanup
```bash
# Delete benchmark pods
kubectl delete job agg-round-robin-dsv32-nvfp4-bench disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}
# Delete deployments
kubectl delete dynamographdeployment agg-round-robin-dsv32-nvfp4 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-kv-dsv32-nvfp4 -n ${NAMESPACE}
```
## References
- [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://github.com/kvcache-ai/Mooncake) - FAST25 paper and trace data
- [Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.html) - TRTLLM tech blog on available optimizations for DSV3.2 on GB200