docs: Add Agg Round Robin vs Disagg KV Router Recipe (#5021)

Signed-off-by: alec-flowers <aflowers@nvidia.com>

docs: Add Agg Round Robin vs Disagg KV Router Recipe (#5021)
Signed-off-by: alec-flowers <aflowers@nvidia.com>
b823575e · Alec · GitHub · 18b64e90 · b823575e · b823575e
Unverified Commit b823575e authored Dec 18, 2025 by Alec Committed by GitHub Dec 18, 2025
7 changed files
--- a/recipes/qwen3-32b/README.md
+++ b/recipes/qwen3-32b/README.md
+# Qwen3-32B: Aggregated Round Robin vs Disaggregated KV Routing Comparison
+
+This recipe demonstrates the performance difference between **aggregated (round-robin)** and **disaggregated (KV-aware)** routing using a real-world conversation trace dataset from the [Mooncake FAST25 paper](https://github.com/kvcache-ai/Mooncake).
+
+## Experiment Overview
+
+We compare two deployment modes on **16x H200 GPUs across 2 nodes**:
+
+| Mode | Routing | Configuration |
+|------|---------|---------------|
+| **Aggregated** | Round-robin | 8x TP2 workers |
+| **Disaggregated** | KV-aware | 6x prefill + 2x decode (TP2) |
+
+## Dataset: Mooncake Conversation Trace
+
+The benchmark uses a production conversation trace with significant prefix sharing potential:
+
+| Metric | Value |
+|--------|-------|
+| Requests | 12,031 over ~59 minutes (3.4 req/s) |
+| Input tokens/sec | 40,937 tok/s |
+| Input length | avg 12,035 tokens (range: 891 - 126,195) |
+| Output length | avg 343 tokens |
+
+**Cache Reuse Analysis:**
+
+| Metric | Value | What It Measures |
+|--------|-------|------------------|
+| Blocks reused | 24.2% | Of 182,790 unique blocks, 44,144 appeared in more than one request |
+| Cache efficiency | 36.64% | Of 288,500 total block references, 105,710 were repeats (reusable with infinite cache) |
+
+*Why these differ:* Block reuse counts unique blocks that repeat, ignoring how often they repeat. Cache efficiency weights by frequency—a block reused 12,031 times contributes more than one reused once.
+
+This workload is ideal for KV-aware routing—with 36.64% cache efficiency, requests can be routed to workers that already have relevant KV blocks cached, significantly reducing TTFT.
+
+## Prerequisites
+
+1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
+2. **16x H200 GPUs** across 2 nodes
+3. **HuggingFace token** configured:
+   ```bash
+   export NAMESPACE=your-namespace
+   kubectl create secret generic hf-token-secret \
+     --from-literal=HF_TOKEN="your-token" \
+     -n ${NAMESPACE}
+   ```
+
+## Quick Start
+
+### 1. Create Storage
+
+> **Note:** Edit `model-cache/cache.yaml` first and update `storageClassName` to match your cluster (run `kubectl get storageclass` to find available options).
+
+```bash
+kubectl apply -f model-cache/cache.yaml -n ${NAMESPACE}
+```
+
+### 2. Download Model
+
+```bash
+kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
+```
+
+### 3. Deploy & Benchmark
+
+**Option A: Aggregated (Round-Robin Baseline)**
+
+```bash
+# Deploy
+kubectl apply -f vllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}
+
+# Wait for ready
+kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-8xtp2 \
+  -n ${NAMESPACE} --timeout=1200s
+
+# Run benchmark
+kubectl apply -f vllm/agg-round-robin/perf.yaml -n ${NAMESPACE}
+```
+
+**Option B: Disaggregated (KV-Aware Routing)**
+
+```bash
+# Deploy
+kubectl apply -f vllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
+
+# Wait for ready
+kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-router-6p-2d \
+  -n ${NAMESPACE} --timeout=1200s
+
+# Run benchmark
+kubectl apply -f vllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
+```
+
+### 4. Monitor Benchmark Progress
+
+The benchmark runs inside a tmux session for easy monitoring:
+
+```bash
+# Find the benchmark pod
+kubectl get pods -n ${NAMESPACE} | grep benchmark
+
+# Attach to the tmux session to see intermediate results
+kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark
+
+# Detach from tmux: Ctrl+B, then D
+```
+
+### 5. View Results
+
+Results are saved to the `perf-cache` PVC:
+
+```bash
+# Check artifact directory
+kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/
+
+# Copy results to local machine
+kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-results
+```
+
+## Expected Results
+
+Since the benchmark uses `--fixed-schedule` (replaying requests at their original timestamps), **throughput metrics are fixed by the trace**—latency metrics are what we're comparing:
+
+| Metric | Why It Matters |
+|--------|----------------|
+| **TTFT** (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits |
+| **ITL** (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference |
+| **Total Request Latency** | Combined benefit of both optimizations |
+
+**Why disaggregated + KV-aware routing helps this workload:**
+
+1. **KV-aware routing** leverages the 36% cache efficiency to route requests to workers that already have relevant KV cache blocks, reducing redundant prefill computation and lowering TTFT.
+
+2. **Disaggregated serving** separates prefill and decode workers. With long input sequences (avg 12K tokens) and short outputs (avg 343 tokens), dedicated decode workers avoid "prefill injection"—where a new long-context request interrupts ongoing decode operations, causing ITL spikes.
+
+## Cleanup
+
+```bash
+# Delete benchmark pods
+kubectl delete pod -l app=benchmark -n ${NAMESPACE}
+
+# Delete deployments
+kubectl delete dynamographdeployment agg-8xtp2 -n ${NAMESPACE}
+kubectl delete dynamographdeployment disagg-router-6p-2d-n ${NAMESPACE}
+```
+
+## References
+
+- [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://github.com/kvcache-ai/Mooncake) - FAST25 paper and trace data
+
--- a/recipes/qwen3-32b/model-cache/cache.yaml
+++ b/recipes/qwen3-32b/model-cache/cache.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: model-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 100Gi
+  storageClassName: "your-storage-class-name"
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: compilation-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 50Gi
+  storageClassName: "your-storage-class-name"
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: perf-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 50Gi
+  storageClassName: "your-storage-class-name"
--- a/recipes/qwen3-32b/model-cache/model-download.yaml
+++ b/recipes/qwen3-32b/model-cache/model-download.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download
+spec:
+  backoffLimit: 3
+  completions: 1
+  parallelism: 1
+  template:
+    metadata:
+      labels:
+        app: model-download
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: model-download
+          image: python:3.10-slim
+          command: ["sh", "-c"]
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          env:
+            - name: MODEL_NAME
+              value: "Qwen/Qwen3-32B"
+            - name: HF_HOME
+              value: /home/dynamo/.cache/huggingface
+            - name: HF_HUB_ENABLE_HF_TRANSFER
+              value: "1"
+            - name: MODEL_REVISION
+              value: 9216db5781bf21249d130ec9da846c4624c16137
+          args:
+            - |
+              set -eux
+              pip install --no-cache-dir huggingface_hub hf_transfer
+              hf download $MODEL_NAME --revision $MODEL_REVISION
+          volumeMounts:
+            - name: model-cache
+              mountPath: /home/dynamo/.cache/huggingface
+      volumes:
+      - name: model-cache
+        persistentVolumeClaim:
+          claimName: model-cache
--- a/recipes/qwen3-32b/vllm/agg-round-robin/deploy.yaml
+++ b/recipes/qwen3-32b/vllm/agg-round-robin/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: agg-8xtp2
+spec:
+  pvcs:
+  - create: false
+    name: model-cache
+  - create: false
+    name: compilation-cache
+  services:
+    Frontend:
+      componentType: frontend
+      dynamoNamespace: agg-8xtp2
+      envs:
+        - name: HF_HOME
+          value: /home/dynamo/.cache/huggingface
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          workingDir: /workspace
+          command:
+            - python3
+            - -m
+            - dynamo.frontend
+          args:
+            - --router-reset-states
+      replicas: 1
+      resources:
+        requests:
+          cpu: "8"
+        limits:
+          cpu: "8"
+      subComponentType: null
+    VllmDecodeWorker:
+      componentType: worker
+      dynamoNamespace: agg-8xtp2
+      envFromSecret: hf-token-secret
+      volumeMounts:
+      - name: model-cache
+        mountPoint: /home/dynamo/.cache/huggingface
+      - name: compilation-cache
+        mountPoint: /home/dynamo/.cache/vllm
+        useAsCompilationCache: true
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --model
+          - Qwen/Qwen3-32B
+          - --tensor-parallel-size
+          - '2'
+          - --disable-log-requests
+          - --gpu-memory-utilization
+          - '0.90'
+          - --async-scheduling
+          - --block-size
+          - '64'
+          - --hf-overrides
+          - '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
+          - --max-model-len
+          - '131072'
+          command:
+          - python3
+          - -m
+          - dynamo.vllm
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          env:
+          - name: DYN_HEALTH_CHECK_ENABLED
+            value: "false"
+          - name: HF_HOME
+            value: /home/dynamo/.cache/huggingface
+          workingDir: /workspace
+      replicas: 8
+      resources:
+        limits:
+          gpu: '2'
+          custom:
+            rdma/ib: "2"
+        requests:
+          gpu: '2'
+      subComponentType: decode
--- a/recipes/qwen3-32b/vllm/agg-round-robin/perf.yaml
+++ b/recipes/qwen3-32b/vllm/agg-round-robin/perf.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: Pod
+metadata:
+  name: agg-8xtp2-benchmark
+  labels:
+    app: benchmark
+spec:
+  containers:
+  - name: python
+    image: python:3.11
+    command:
+      - /bin/bash
+      - -lc
+      - |
+        # Setup
+        ulimit -n 1048576
+        ulimit -u 65536
+        apt update && apt install tmux wget curl jq -y
+
+        # Install benchmarking tool
+        pip install aiperf
+
+        # Wait for model to be ready
+        echo "Waiting for model '${MODEL_NAME}' at http://${FRONTEND}:8000/v1/models..."
+        until curl -s "http://${FRONTEND}:8000/v1/models" | jq -e --arg model "${MODEL_NAME}" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+          echo "[$(date '+%H:%M:%S')] Model not ready, retrying in 5s..."
+          sleep 5
+        done
+        echo "Model '${MODEL_NAME}' is ready!"
+
+        # Download Mooncake conversation trace dataset if not already present
+        mkdir -p ${BASE_DIR}/traces
+        mkdir -p ${BASE_DIR}/artifacts
+        if [ ! -f ${BASE_DIR}/traces/conversation_trace.jsonl ]; then
+          wget -qO ${BASE_DIR}/traces/conversation_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/main/FAST25-release/traces/conversation_trace.jsonl
+        fi
+
+        # Setup Paths and Endpoints
+        export INPUT_FILE="${BASE_DIR}/traces/conversation_trace.jsonl"
+        export MODEL_BASE_NAME="${MODEL_NAME##*/}"
+        export FRONTEND_LIB="${FRONTEND%-frontend}"
+        export ARTIFACT_DIR="${BASE_DIR}/artifacts/${MODEL_BASE_NAME}_${FRONTEND_LIB}"
+        mkdir -p "${ARTIFACT_DIR}"
+
+        # Run Benchmark so its easy to attach and watch
+        tmux new-session -d -s benchmark -c "${ARTIFACT_DIR}"
+        tmux send-keys -t benchmark "aiperf profile -m ${MODEL_NAME} --input-file ${INPUT_FILE} --custom-dataset-type mooncake_trace --fixed-schedule --url http://${FRONTEND}:8000 --streaming --artifact-dir ${ARTIFACT_DIR} --goodput \"time_to_first_token:2000 inter_token_latency:25\"" C-m
+        sleep 7200
+    env:
+      - name: MODEL_NAME
+        value: Qwen/Qwen3-32B
+      - name: FRONTEND
+        value: agg-8xtp2-frontend
+      - name: BASE_DIR
+        value: /perf-cache
+    resources:
+      requests:
+        cpu: "8"
+        memory: 16Gi
+      limits:
+        cpu: "16"
+        memory: 32Gi
+    volumeMounts:
+    - name: model-cache
+      mountPath: /home/dynamo/.cache/huggingface
+    - name: perf-cache
+      mountPath: /perf-cache
+    workingDir: /workspace
+  volumes:
+  - name: model-cache
+    persistentVolumeClaim:
+      claimName: model-cache
+  - name: perf-cache
+    persistentVolumeClaim:
+      claimName: perf-cache
+  restartPolicy: Never
--- a/recipes/qwen3-32b/vllm/disagg-kv-router/deploy.yaml
+++ b/recipes/qwen3-32b/vllm/disagg-kv-router/deploy.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: disagg-router-6p-2d
+spec:
+  pvcs:
+  - create: false
+    name: model-cache
+  - create: false
+    name: compilation-cache
+  services:
+    Frontend:
+      componentType: frontend
+      dynamoNamespace: disagg-router-6p-2d
+      envs:
+        - name: HF_HOME
+          value: /home/dynamo/.cache/huggingface
+      extraPodSpec:
+        mainContainer:
+          args:
+            - --router-mode
+            - kv
+            - --router-reset-states
+          command:
+            - python
+            - -m
+            - dynamo.frontend
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          workingDir: /workspace
+      replicas: 1
+      resources:
+        requests:
+          cpu: "8"
+        limits:
+          cpu: "8"
+      subComponentType: null
+    VllmDecodeWorker:
+      componentType: worker
+      dynamoNamespace: disagg-router-6p-2d
+      envFromSecret: hf-token-secret
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --model
+          - Qwen/Qwen3-32B
+          - --tensor-parallel-size
+          - '2'
+          - --disable-log-requests
+          - --gpu-memory-utilization
+          - '0.90'
+          - --no-enable-prefix-caching
+          - --async-scheduling
+          - --block-size
+          - '64'
+          - --hf-overrides
+          - '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
+          - --max-model-len
+          - '131072'
+          command:
+          - python3
+          - -m
+          - dynamo.vllm
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          workingDir: /workspace
+          env:
+          - name: DYN_HEALTH_CHECK_ENABLED
+            value: "false"
+          - name: HF_HOME
+            value: /home/dynamo/.cache/huggingface
+      replicas: 2
+      resources:
+        limits:
+          gpu: '2'
+          custom:
+            rdma/ib: "2"
+        requests:
+          gpu: '2'
+      subComponentType: decode
+      volumeMounts:
+      - name: model-cache
+        mountPoint: /home/dynamo/.cache/huggingface
+      - name: compilation-cache
+        mountPoint: /home/dynamo/.cache/vllm
+        useAsCompilationCache: true
+    VllmPrefillWorker:
+      componentType: worker
+      dynamoNamespace: disagg-router-6p-2d
+      envFromSecret: hf-token-secret
+      extraPodMetadata:
+        annotations:
+          prometheus.io/scrape: "true"
+          prometheus.io/port: "9400"
+          prometheus.io/path: "/metrics"
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --model
+          - Qwen/Qwen3-32B
+          - --is-prefill-worker
+          - --tensor-parallel-size
+          - '2'
+          - --disable-log-requests
+          - --gpu-memory-utilization
+          - '0.90'
+          - --async-scheduling
+          - --block-size
+          - '64'
+          - --hf-overrides
+          - '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
+          - --max-model-len
+          - '131072'
+          command:
+          - python3
+          - -m
+          - dynamo.vllm
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          env:
+          - name: DYN_HEALTH_CHECK_ENABLED
+            value: "false"
+          - name: HF_HOME
+            value: /home/dynamo/.cache/huggingface
+          workingDir: /workspace
+      replicas: 6
+      resources:
+        limits:
+          gpu: '2'
+          custom:
+            rdma/ib: "2"
+        requests:
+          gpu: '2'
+      subComponentType: prefill
+      volumeMounts:
+      - name: model-cache
+        mountPoint: /home/dynamo/.cache/huggingface
+      - name: compilation-cache
+        mountPoint: /home/dynamo/.cache/vllm
+        useAsCompilationCache: true
--- a/recipes/qwen3-32b/vllm/disagg-kv-router/perf.yaml
+++ b/recipes/qwen3-32b/vllm/disagg-kv-router/perf.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: Pod
+metadata:
+  name: disagg-router-6p-2d-benchmark
+  labels:
+    app: benchmark
+spec:
+  containers:
+  - name: python
+    image: python:3.11
+    command:
+      - /bin/bash
+      - -lc
+      - |
+        # Setup
+        ulimit -n 1048576
+        ulimit -u 65536
+        apt update && apt install tmux wget curl jq -y
+
+        # Install benchmarking tool
+        pip install aiperf
+
+        # Wait for model to be ready
+        echo "Waiting for model '${MODEL_NAME}' at http://${FRONTEND}:8000/v1/models..."
+        until curl -s "http://${FRONTEND}:8000/v1/models" | jq -e --arg model "${MODEL_NAME}" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
+          echo "[$(date '+%H:%M:%S')] Model not ready, retrying in 5s..."
+          sleep 5
+        done
+        echo "Model '${MODEL_NAME}' is ready!"
+
+        # Download Mooncake conversation trace dataset if not already present
+        mkdir -p ${BASE_DIR}/traces
+        mkdir -p ${BASE_DIR}/artifacts
+        if [ ! -f ${BASE_DIR}/traces/conversation_trace.jsonl ]; then
+          wget -qO ${BASE_DIR}/traces/conversation_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/main/FAST25-release/traces/conversation_trace.jsonl
+        fi
+
+        # Setup Paths and Endpoints
+        export INPUT_FILE="${BASE_DIR}/traces/conversation_trace.jsonl"
+        export MODEL_BASE_NAME="${MODEL_NAME##*/}"
+        export FRONTEND_LIB="${FRONTEND%-frontend}"
+        export ARTIFACT_DIR="${BASE_DIR}/artifacts/${MODEL_BASE_NAME}_${FRONTEND_LIB}"
+        mkdir -p "${ARTIFACT_DIR}"
+
+        # Run Benchmark so its easy to attach and watch
+        tmux new-session -d -s benchmark -c "${ARTIFACT_DIR}"
+        tmux send-keys -t benchmark "aiperf profile -m ${MODEL_NAME} --input-file ${INPUT_FILE} --custom-dataset-type mooncake_trace --fixed-schedule --url http://${FRONTEND}:8000 --streaming --artifact-dir ${ARTIFACT_DIR} --goodput \"time_to_first_token:2000 inter_token_latency:25\"" C-m
+        sleep 7200
+    env:
+      - name: MODEL_NAME
+        value: Qwen/Qwen3-32B
+      - name: FRONTEND
+        value: disagg-router-6p-2d-frontend
+      - name: BASE_DIR
+        value: /perf-cache
+    resources:
+      requests:
+        cpu: "8"
+        memory: 16Gi
+      limits:
+        cpu: "16"
+        memory: 32Gi
+    volumeMounts:
+    - name: model-cache
+      mountPath: /home/dynamo/.cache/huggingface
+    - name: perf-cache
+      mountPath: /perf-cache
+    workingDir: /workspace
+  volumes:
+  - name: model-cache
+    persistentVolumeClaim:
+      claimName: model-cache
+  - name: perf-cache
+    persistentVolumeClaim:
+      claimName: perf-cache
+  restartPolicy: Never