Unverified Commit b823575e authored by Alec's avatar Alec Committed by GitHub
Browse files

docs: Add Agg Round Robin vs Disagg KV Router Recipe (#5021)


Signed-off-by: default avataralec-flowers <aflowers@nvidia.com>
parent 18b64e90
# Qwen3-32B: Aggregated Round Robin vs Disaggregated KV Routing Comparison
This recipe demonstrates the performance difference between **aggregated (round-robin)** and **disaggregated (KV-aware)** routing using a real-world conversation trace dataset from the [Mooncake FAST25 paper](https://github.com/kvcache-ai/Mooncake).
## Experiment Overview
We compare two deployment modes on **16x H200 GPUs across 2 nodes**:
| Mode | Routing | Configuration |
|------|---------|---------------|
| **Aggregated** | Round-robin | 8x TP2 workers |
| **Disaggregated** | KV-aware | 6x prefill + 2x decode (TP2) |
## Dataset: Mooncake Conversation Trace
The benchmark uses a production conversation trace with significant prefix sharing potential:
| Metric | Value |
|--------|-------|
| Requests | 12,031 over ~59 minutes (3.4 req/s) |
| Input tokens/sec | 40,937 tok/s |
| Input length | avg 12,035 tokens (range: 891 - 126,195) |
| Output length | avg 343 tokens |
**Cache Reuse Analysis:**
| Metric | Value | What It Measures |
|--------|-------|------------------|
| Blocks reused | 24.2% | Of 182,790 unique blocks, 44,144 appeared in more than one request |
| Cache efficiency | 36.64% | Of 288,500 total block references, 105,710 were repeats (reusable with infinite cache) |
*Why these differ:* Block reuse counts unique blocks that repeat, ignoring how often they repeat. Cache efficiency weights by frequency—a block reused 12,031 times contributes more than one reused once.
This workload is ideal for KV-aware routing—with 36.64% cache efficiency, requests can be routed to workers that already have relevant KV blocks cached, significantly reducing TTFT.
## Prerequisites
1. **Dynamo Platform installed** - See [Kubernetes Deployment Guide](../../docs/kubernetes/README.md)
2. **16x H200 GPUs** across 2 nodes
3. **HuggingFace token** configured:
```bash
export NAMESPACE=your-namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token" \
-n ${NAMESPACE}
```
## Quick Start
### 1. Create Storage
> **Note:** Edit `model-cache/cache.yaml` first and update `storageClassName` to match your cluster (run `kubectl get storageclass` to find available options).
```bash
kubectl apply -f model-cache/cache.yaml -n ${NAMESPACE}
```
### 2. Download Model
```bash
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
```
### 3. Deploy & Benchmark
**Option A: Aggregated (Round-Robin Baseline)**
```bash
# Deploy
kubectl apply -f vllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}
# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-8xtp2 \
-n ${NAMESPACE} --timeout=1200s
# Run benchmark
kubectl apply -f vllm/agg-round-robin/perf.yaml -n ${NAMESPACE}
```
**Option B: Disaggregated (KV-Aware Routing)**
```bash
# Deploy
kubectl apply -f vllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-router-6p-2d \
-n ${NAMESPACE} --timeout=1200s
# Run benchmark
kubectl apply -f vllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
```
### 4. Monitor Benchmark Progress
The benchmark runs inside a tmux session for easy monitoring:
```bash
# Find the benchmark pod
kubectl get pods -n ${NAMESPACE} | grep benchmark
# Attach to the tmux session to see intermediate results
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark
# Detach from tmux: Ctrl+B, then D
```
### 5. View Results
Results are saved to the `perf-cache` PVC:
```bash
# Check artifact directory
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/
# Copy results to local machine
kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-results
```
## Expected Results
Since the benchmark uses `--fixed-schedule` (replaying requests at their original timestamps), **throughput metrics are fixed by the trace**—latency metrics are what we're comparing:
| Metric | Why It Matters |
|--------|----------------|
| **TTFT** (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits |
| **ITL** (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference |
| **Total Request Latency** | Combined benefit of both optimizations |
**Why disaggregated + KV-aware routing helps this workload:**
1. **KV-aware routing** leverages the 36% cache efficiency to route requests to workers that already have relevant KV cache blocks, reducing redundant prefill computation and lowering TTFT.
2. **Disaggregated serving** separates prefill and decode workers. With long input sequences (avg 12K tokens) and short outputs (avg 343 tokens), dedicated decode workers avoid "prefill injection"—where a new long-context request interrupts ongoing decode operations, causing ITL spikes.
## Cleanup
```bash
# Delete benchmark pods
kubectl delete pod -l app=benchmark -n ${NAMESPACE}
# Delete deployments
kubectl delete dynamographdeployment agg-8xtp2 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-router-6p-2d-n ${NAMESPACE}
```
## References
- [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://github.com/kvcache-ai/Mooncake) - FAST25 paper and trace data
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: "your-storage-class-name"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: compilation-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: "your-storage-class-name"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: perf-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: "your-storage-class-name"
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
backoffLimit: 3
completions: 1
parallelism: 1
template:
metadata:
labels:
app: model-download
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
envFrom:
- secretRef:
name: hf-token-secret
env:
- name: MODEL_NAME
value: "Qwen/Qwen3-32B"
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
- name: MODEL_REVISION
value: 9216db5781bf21249d130ec9da846c4624c16137
args:
- |
set -eux
pip install --no-cache-dir huggingface_hub hf_transfer
hf download $MODEL_NAME --revision $MODEL_REVISION
volumeMounts:
- name: model-cache
mountPath: /home/dynamo/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: agg-8xtp2
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
Frontend:
componentType: frontend
dynamoNamespace: agg-8xtp2
envs:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace
command:
- python3
- -m
- dynamo.frontend
args:
- --router-reset-states
replicas: 1
resources:
requests:
cpu: "8"
limits:
cpu: "8"
subComponentType: null
VllmDecodeWorker:
componentType: worker
dynamoNamespace: agg-8xtp2
envFromSecret: hf-token-secret
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-32B
- --tensor-parallel-size
- '2'
- --disable-log-requests
- --gpu-memory-utilization
- '0.90'
- --async-scheduling
- --block-size
- '64'
- --hf-overrides
- '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
- --max-model-len
- '131072'
command:
- python3
- -m
- dynamo.vllm
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
env:
- name: DYN_HEALTH_CHECK_ENABLED
value: "false"
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
workingDir: /workspace
replicas: 8
resources:
limits:
gpu: '2'
custom:
rdma/ib: "2"
requests:
gpu: '2'
subComponentType: decode
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: Pod
metadata:
name: agg-8xtp2-benchmark
labels:
app: benchmark
spec:
containers:
- name: python
image: python:3.11
command:
- /bin/bash
- -lc
- |
# Setup
ulimit -n 1048576
ulimit -u 65536
apt update && apt install tmux wget curl jq -y
# Install benchmarking tool
pip install aiperf
# Wait for model to be ready
echo "Waiting for model '${MODEL_NAME}' at http://${FRONTEND}:8000/v1/models..."
until curl -s "http://${FRONTEND}:8000/v1/models" | jq -e --arg model "${MODEL_NAME}" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
echo "[$(date '+%H:%M:%S')] Model not ready, retrying in 5s..."
sleep 5
done
echo "Model '${MODEL_NAME}' is ready!"
# Download Mooncake conversation trace dataset if not already present
mkdir -p ${BASE_DIR}/traces
mkdir -p ${BASE_DIR}/artifacts
if [ ! -f ${BASE_DIR}/traces/conversation_trace.jsonl ]; then
wget -qO ${BASE_DIR}/traces/conversation_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/main/FAST25-release/traces/conversation_trace.jsonl
fi
# Setup Paths and Endpoints
export INPUT_FILE="${BASE_DIR}/traces/conversation_trace.jsonl"
export MODEL_BASE_NAME="${MODEL_NAME##*/}"
export FRONTEND_LIB="${FRONTEND%-frontend}"
export ARTIFACT_DIR="${BASE_DIR}/artifacts/${MODEL_BASE_NAME}_${FRONTEND_LIB}"
mkdir -p "${ARTIFACT_DIR}"
# Run Benchmark so its easy to attach and watch
tmux new-session -d -s benchmark -c "${ARTIFACT_DIR}"
tmux send-keys -t benchmark "aiperf profile -m ${MODEL_NAME} --input-file ${INPUT_FILE} --custom-dataset-type mooncake_trace --fixed-schedule --url http://${FRONTEND}:8000 --streaming --artifact-dir ${ARTIFACT_DIR} --goodput \"time_to_first_token:2000 inter_token_latency:25\"" C-m
sleep 7200
env:
- name: MODEL_NAME
value: Qwen/Qwen3-32B
- name: FRONTEND
value: agg-8xtp2-frontend
- name: BASE_DIR
value: /perf-cache
resources:
requests:
cpu: "8"
memory: 16Gi
limits:
cpu: "16"
memory: 32Gi
volumeMounts:
- name: model-cache
mountPath: /home/dynamo/.cache/huggingface
- name: perf-cache
mountPath: /perf-cache
workingDir: /workspace
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
- name: perf-cache
persistentVolumeClaim:
claimName: perf-cache
restartPolicy: Never
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: disagg-router-6p-2d
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
Frontend:
componentType: frontend
dynamoNamespace: disagg-router-6p-2d
envs:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
extraPodSpec:
mainContainer:
args:
- --router-mode
- kv
- --router-reset-states
command:
- python
- -m
- dynamo.frontend
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace
replicas: 1
resources:
requests:
cpu: "8"
limits:
cpu: "8"
subComponentType: null
VllmDecodeWorker:
componentType: worker
dynamoNamespace: disagg-router-6p-2d
envFromSecret: hf-token-secret
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-32B
- --tensor-parallel-size
- '2'
- --disable-log-requests
- --gpu-memory-utilization
- '0.90'
- --no-enable-prefix-caching
- --async-scheduling
- --block-size
- '64'
- --hf-overrides
- '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
- --max-model-len
- '131072'
command:
- python3
- -m
- dynamo.vllm
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace
env:
- name: DYN_HEALTH_CHECK_ENABLED
value: "false"
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
replicas: 2
resources:
limits:
gpu: '2'
custom:
rdma/ib: "2"
requests:
gpu: '2'
subComponentType: decode
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
VllmPrefillWorker:
componentType: worker
dynamoNamespace: disagg-router-6p-2d
envFromSecret: hf-token-secret
extraPodMetadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
prometheus.io/path: "/metrics"
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-32B
- --is-prefill-worker
- --tensor-parallel-size
- '2'
- --disable-log-requests
- --gpu-memory-utilization
- '0.90'
- --async-scheduling
- --block-size
- '64'
- --hf-overrides
- '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}'
- --max-model-len
- '131072'
command:
- python3
- -m
- dynamo.vllm
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
env:
- name: DYN_HEALTH_CHECK_ENABLED
value: "false"
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
workingDir: /workspace
replicas: 6
resources:
limits:
gpu: '2'
custom:
rdma/ib: "2"
requests:
gpu: '2'
subComponentType: prefill
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: Pod
metadata:
name: disagg-router-6p-2d-benchmark
labels:
app: benchmark
spec:
containers:
- name: python
image: python:3.11
command:
- /bin/bash
- -lc
- |
# Setup
ulimit -n 1048576
ulimit -u 65536
apt update && apt install tmux wget curl jq -y
# Install benchmarking tool
pip install aiperf
# Wait for model to be ready
echo "Waiting for model '${MODEL_NAME}' at http://${FRONTEND}:8000/v1/models..."
until curl -s "http://${FRONTEND}:8000/v1/models" | jq -e --arg model "${MODEL_NAME}" '.data[]? | select(.id == $model)' >/dev/null 2>&1; do
echo "[$(date '+%H:%M:%S')] Model not ready, retrying in 5s..."
sleep 5
done
echo "Model '${MODEL_NAME}' is ready!"
# Download Mooncake conversation trace dataset if not already present
mkdir -p ${BASE_DIR}/traces
mkdir -p ${BASE_DIR}/artifacts
if [ ! -f ${BASE_DIR}/traces/conversation_trace.jsonl ]; then
wget -qO ${BASE_DIR}/traces/conversation_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/main/FAST25-release/traces/conversation_trace.jsonl
fi
# Setup Paths and Endpoints
export INPUT_FILE="${BASE_DIR}/traces/conversation_trace.jsonl"
export MODEL_BASE_NAME="${MODEL_NAME##*/}"
export FRONTEND_LIB="${FRONTEND%-frontend}"
export ARTIFACT_DIR="${BASE_DIR}/artifacts/${MODEL_BASE_NAME}_${FRONTEND_LIB}"
mkdir -p "${ARTIFACT_DIR}"
# Run Benchmark so its easy to attach and watch
tmux new-session -d -s benchmark -c "${ARTIFACT_DIR}"
tmux send-keys -t benchmark "aiperf profile -m ${MODEL_NAME} --input-file ${INPUT_FILE} --custom-dataset-type mooncake_trace --fixed-schedule --url http://${FRONTEND}:8000 --streaming --artifact-dir ${ARTIFACT_DIR} --goodput \"time_to_first_token:2000 inter_token_latency:25\"" C-m
sleep 7200
env:
- name: MODEL_NAME
value: Qwen/Qwen3-32B
- name: FRONTEND
value: disagg-router-6p-2d-frontend
- name: BASE_DIR
value: /perf-cache
resources:
requests:
cpu: "8"
memory: 16Gi
limits:
cpu: "16"
memory: 32Gi
volumeMounts:
- name: model-cache
mountPath: /home/dynamo/.cache/huggingface
- name: perf-cache
mountPath: /perf-cache
workingDir: /workspace
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
- name: perf-cache
persistentVolumeClaim:
claimName: perf-cache
restartPolicy: Never
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment