# DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP This **GB200 NVL72** recipe for DeepSeek V3.2 demonstrates the performance difference between **aggregated (round-robin) routing** and **disaggregated (KV-aware) routing + WideEP** on a synthetic trace dataset adapted from the [Mooncake FAST25 paper](https://github.com/kvcache-ai/Mooncake). ## Results https://github.com/user-attachments/assets/fcdb703c-7c1a-4109-a7ca-54196fcef885 ## Experiment Overview We compare two deployment modes on **32x GB200 GPUs across 8 nodes**: | Mode | Routing | Configuration | |------|---------|---------------| | **Aggregated** | Round-robin | 4x DEP8 workers | | **Disaggregated** | KV-aware | 2x prefill + 2x decode w/ WideEP (DEP8) | ## Dataset: Mooncake-based Synthetic Coding Trace The benchmark uses a trace which simulates coding workloads. We synthesize the trace by increasing the input sequence length and prefix reuse rate of the original [Mooncake conversation trace](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/traces/conversation_trace.jsonl). To reproduce our benchmark, run Dynamo's [prefix data generator tool](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/prefix_data_generator) on the Mooncake `conversation_trace.jsonl`: ```bash datagen synthesize \ --input-file conversation_trace.jsonl \ --prefix-len-multiplier 16 \ --prompt-len-multiplier 10 \ --max-isl 110000 \ --num-requests 10000 # synthesizes `conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl` ``` The ISL/OSL/cache hit statistics of our trace is below.
Dataset statistics: Mooncake-based Synthetic Trace ``` ============================================================ DATASET ANALYSIS: Mooncake-based Synthetic Trace ============================================================ OVERVIEW ---------------------------------------- Total Requests: 10,000 Unique Hash Blocks: 430,838 Total Hash Blocks: 770,934 INPUT SEQUENCE LENGTH (ISL) ---------------------------------------- Average: 39,186 tokens Maximum: 109,459 tokens Minimum: 12,801 tokens OUTPUT SEQUENCE LENGTH (OSL) ---------------------------------------- Average: 344 tokens Maximum: 2,000 tokens Minimum: 1 tokens KV CACHE / PREFIX REUSE ---------------------------------------- Block-level Hit Rate: 44.1% Token-level Hit Rate: 44.0% Avg Context (shared): 22,400 tokens/req Avg Unique Prompt: 16,786 tokens/req Shared Prefix Ratio: 57.2% ============================================================ Summary: • ~44% KV cache hit rate (block/token level) based on hash_id overlap across requests • ~57% of input tokens come from shared context prefixes • Long-context workload: avg 39K input tokens, up to 109K max ```
## Prerequisites 1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md) 2. **32x GB200 GPUs** across 8 nodes 3. **HuggingFace token** configured: ```bash export NAMESPACE=your-namespace kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN="your-token" \ -n ${NAMESPACE} ``` ## Quick Start ### 1. Create Storage > **Note:** Edit `model-cache/model-cache.yaml` first and update `storageClassName` to match your cluster (run `kubectl get storageclass` to find available options). ```bash kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE} ``` ### 2. Configure K8 Benchmarking Environment For multinode kubernetes deployments, your cluster may require a ComputeDomain to exist in your namespace such that the DRA scheduler can co-locate worker pods on MNNVL-connected nodes. (Otherwise, internode GPU peer memory access would fail.) ```bash kubectl apply -f model-cache/compute-domain.yaml -n ${NAMESPACE} ``` Make sure to apply any name modifications to this file to the deployment yamls, under `extraPodSpec.resourceClaims` and `mainContainer.resources.claims`. ### 3. Setup Model and Data We use NVIDIA's official NVFP4-quantized checkpoint ([Huggingface](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4)). Copy it into the PVC storage: ```bash kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE} kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s ``` Similarly, copy the trace file for the benchmark into the PVC: ```bash # conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl in our case kubectl cp your-namespace/:/model-cache/traces/ ``` ### 4. Deploy & Benchmark **Option A: Aggregated (Round-Robin Baseline)** ```bash # Deploy kubectl apply -f trtllm/agg-round-robin/deploy.yaml -n ${NAMESPACE} # Wait for ready kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-round-robin-dsv32-nvfp4 \ -n ${NAMESPACE} --timeout=1200s # Run benchmark kubectl apply -f trtllm/agg-round-robin/perf.yaml -n ${NAMESPACE} ``` **Option B: Disaggregated (KV-Aware Routing)** ```bash # Deploy kubectl apply -f trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE} # Wait for ready kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \ -n ${NAMESPACE} --timeout=1200s # Run benchmark kubectl apply -f trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE} ``` ### 4. Monitor Benchmark Progress The benchmark runs inside a tmux session for easy monitoring: ```bash # Find the benchmark pod kubectl get pods -n ${NAMESPACE} | grep benchmark # Attach to the tmux session to see intermediate results kubectl exec -it -n ${NAMESPACE} -- tmux a -t benchmark # Detach from tmux: Ctrl+B, then D ``` ### 5. View Results Results are saved to the `perf-cache` PVC: ```bash # Check artifact directory kubectl exec -it -n ${NAMESPACE} -- ls -la /perf-cache/artifacts/ # Copy results to local machine kubectl cp ${NAMESPACE}/:/perf-cache/artifacts ./benchmark-results ``` ## Expected Results Since the benchmark uses `--fixed-schedule` (replaying requests at their original timestamps), **throughput metrics are fixed by the trace**—latency metrics are what we're comparing: | Metric | Why It Matters | |--------|----------------| | **TTFT** (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits | | **ITL** (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference | | **Total Request Latency** | Combined benefit of both optimizations | For production contexts, we can further evaluate the deployments with **goodput**, i.e. the rate of requests which satisfy a predetermined service level agreement (SLA). For our experiments, we set the SLA as TTFT=20s and ITL=50ms. ## Cleanup ```bash # Delete benchmark pods kubectl delete job agg-round-robin-dsv32-nvfp4-bench disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE} # Delete deployments kubectl delete dynamographdeployment agg-round-robin-dsv32-nvfp4 -n ${NAMESPACE} kubectl delete dynamographdeployment disagg-kv-dsv32-nvfp4 -n ${NAMESPACE} ``` ## References - [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://github.com/kvcache-ai/Mooncake) - FAST25 paper and trace data - [Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.html) - TRTLLM tech blog on available optimizations for DSV3.2 on GB200